Quality vs Data/Domain Understanding
Understanding your data and domain is the key to Quality Assurance!
I always wonder how much our lives have been upgraded with the revolutionary AI-driven decision-making in diverse fields. That’s inarguably a huge milestone that we humans have achieved so far and are still progressing and making an impact. To understand the end-user’s needs and expectations, data is the key and to understand its significance, purpose, source, target, and expectations are even more important. Where we are witnessing rapid advancements in the digital world, we definitely can’t ignore the human efforts that have enabled these transformations to happen. Only quality input can bring quality outcomes, and Quality Assurance Engineers are playing a vital part in this digitisation.
“What counts in life is not the mere fact that we have lived; it is what difference we have made to the lives of others that will determine the significance of the life we lead.” — Nelson Mandela
Lack of information leads to disastrous decision-making!
Proper domain knowledge and an in-depth understanding of the data you are dealing with or will test to ensure a production-ready, bug-free, and high-quality platform, is the key for quality assurance. Without proper domain knowledge, you can’t unfold the underlying hidden anomalies within the data.
From my personal experiences, I learned over time how critical it is to understand the domain from different perspectives and then start designing and implementing the negative test cases to uncover discrepancies in the data. For example, a few years ago, I was doing my Master’s thesis and given a large genomic dataset for human obesity prediction. Before everything, I had to make sure the data was correct, consistent, and complete, and for that reason, I had to ensure its quality.
Without further digging into the data analytics stuff, I’d rather keep my main focus on the important findings from the data for the sake of providing only relevant information for this topic from a testing point-of-view and why a prior in-depth understanding of the data is important.
As soon as I started with the data filtration, although the given dataset was of human genomic data with over 700,000 records, shockingly I discovered the dataset was manipulated already and it had some unexpected data points which didn’t belong to humans at all. No one would have guessed that this dataset could have Rice DNA data too. How did that happen? Luckily, that’s not our concern, but imagine, how about predicting human obesity with malicious data. Undoubtedly, that’d be a disaster in medical treatment!
A couple of important points can be taken from the above findings regarding the domain understanding;
This is human genomic medical data. (this point shows the importance of why there shouldn't be any irrelevant data)
The dataset will be used to predict human obesity. (this point establishes the worth point of the data, and why it has to be critically analysed and cleaned)
Only when you know the importance of the data, how critical the output would be, and why it has to be carefully prepared for a certain domain, it help you think out of the box, identify potential risk elements, and ensure its credibility.
Conclusion
Regardless of the domain, machine learning algorithms are trained to perform specific tasks with accuracy, and their training data is the critical part. Here we can conclude a simple rule for the trained models ‘Garbage in, garbage out!’. If we need accurate and reliable results then the data has to be clean, persistent, and relevant.
If you’re interested in more content about Software QA & Data QA, be sure to Subscribe and follow.
Medium: https://medium.com/@ahsan924
LinkedIn: https://www.linkedin.com/in/ahsanbilal/
Substack (Data QA): https://dataqa.substack.com/
Substack (QA Warrior): https://qawarrior.substack.com/