Know Your Data Quality
Message of the Day
Data quality is the degree to which data meets the purposes and requirements of its use. Depending on the uses, good quality data may refer to complete, accurate, credible, consistent or “good enough” data.
Things to consider
What is data quality and how can we distinguish between good and bad data? How are the issues of data quality being addressed in various disciplines?
- Most straightforward definition of data quality is that data quality is the quality of content (values) in one’s dataset. For example, if a dataset contains names and addresses of customers, all names and addresses have to be recorded (data is complete), they have to correspond to the actual names and addresses (data is accurate), and all records are up-to-date (data is current).
- Most common characteristics of data quality include completeness, validity, consistency, timeliness and accuracy. Additionally, data has to be useful (fit for purpose) and documented and reproducible / verifiable.
- At least four activities impact the quality of data: modeling the world (deciding what to collect and how), collecting or generating data, storage/access, and formating / transformation
- Assessing data quality requires disciplinary knowledge and is time-consuming
- Data quality issues: how to measure, how to track lineage of data (provenance), when data is “good enough”, what happens when data is mixed and triangulated (esp. high quality and low quality data), crowdsourcing for quality
- Data quality is responsibility of both data providers and data curators: data providers ensure the quality of their individual datasets, while curators help the community with consistency, coverage and metadata.
“Care and Quality are internal and external aspects of the same thing. A person who sees Quality and feels it as he works is a person who cares. A person who cares about what he sees and does is a person who’s bound to have some characteristic of quality.”
― Robert M. Pirsig, Zen and the Art of Motorcycle Maintenance: An Inquiry Into Values
- Bad Data Costs the U.S. $3 Trillion Per Year https://hbr.org/2016/09/bad-data-costs-the-u-s-3-trillion-per-year
- Data Quality and Curation http://datascience.codata.org/articles/abstract/10.2481/dsj.GRDI-011/
- Good data are not enough http://www.nature.com/news/good-data-are-not-enough-1.20906
- Bad data issues guide https://github.com/Quartz/bad-data-guide
- Examples of how not to prepare or provide data http://okfnlabs.org/bad-data/
- Data quality assessment (provides a table of various quality dimensions and their definitions): Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment. Communications of the ACM, 45(4), 211. http://doi.org/10.1145/505248.506010
- Procedures for improved quality and reliablity of documentation in water sample data http://www.go-ship.org/Manual/Swift_DataEval.pdf
- How Do We Define Clinical Trial Data Quality if No Guidelines Exist? http://www.appliedclinicaltrialsonline.com/how-do-we-define-clinical-trial-data-quality-if-no-guidelines-exist
- CDISC (Clinical Data Interchange Standards Consortium) Standards: https://www.cdisc.org/standards
- Show your most recent dataset (or part of it) to your colleague and ask their opinion of its quality (exchanging datasets with a colleague makes this activity more fun).
- Use criteria for good data (e.g., completeness, accuracy, fitness for use, documentation) to assess where your data stands.
- Discuss your approaches to data collection and measures you took / could take to ensure integrity and completeness of your data.
- Discuss steps to address missing or incomplete data in the context of your research. Does it matter? How much missing data affects validity, reliability or trustworthiness of your conclusions?
- Check out the Calling Bullshit Syllabus (e.g., Food Stamp Fraud or the Musician Mortality Case Study) What can we learn about data quality from these stories?