Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning

What can data scientists learn from the archival sciences about data collection?

Publisher: Conference on Fairness, Accountability, and Transparency (FAT ’20), January 27–30, 2020, Barcelona, Spain. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3351095.3372829


Reader, you are likely already aware that haphazardly-collected data can propagate social biases. Most ML research focused on devising state-of-the-art models and benchmarks does not critically examine the data sources used to develop models and benchmarks. Data curation as a whole is not a part of most ML research portfolios. The authors propose an alternate approach to interrogating data informed by archival research best practices. Further, they advocate for the creation of an interdisciplinary field whose mission is “data gathering, sharing, annotation, ethics monitoring, and record-keeping processes.”

The authors focus on archives because they have many foci that parallel the needs of data curation in ML, including full-time curators, community participation, standardized documentation methods, and professional codes of ethics. Fundamentally, archival science can offer extensive study on the extraction of human information. Archives represent the other extreme of the data collection spectrum, interventionist where ML is unscrupulous. But the authors argue that adapting a more interventionist approach to data collection can help researchers avoid replicating historical and representational biases.

A list of proposed integrations from archival science that stood out to me include:

  1. Data consortia, which would set up institutional frameworks for ethical data collection to benefit both small and large firms.
  2. Resources that ensure transparency, such as Datasheets for Datasets, which record not only the contents of data, but also the process of data collection.
  3. A code of ethics for data collection with appropriate incentive measures to ensure compliance.

The authors conclude with a macro-list of action items:

  1. Congregate and develop data consortia
  2. Establish professional organizations that work by member-ship to enforce adherence to ethical guidelines
  3. Support community archives
  4. Develop a subfield dedicated to the data collection and an-notation process

My Takeaways

The idea of treating data collection carefully is not new to me; in economics research, generally an entire section of a paper details data collection methods and the overarching statistical properties of the data. I also have experience with archival research through a research project I completed in college. However, this paper has made me realize that just looking at superficial metadata like summary statistics is woefully inadequate. The process for data curation outlined here is so thorough and thoughtful, and I genuinely want to live in a world where data collection is treated with more reverence like explicated in this research.

I’m particularly struck by this recommendation to the data science community: “form or integrate existing global/national organizations in instituting standardized codes of ethics/conduct and procedures to review violations.” So often, when I think about ML ethics, it’s hard for me to envision a sufficient incentive structure. I think the authors succeed in depicting the archival model, or something more like it, as a means of safeguarding ML models.