Fairness Issues as a Consequence of Technical Debt
I recently transitioned off a client project involving the development of a large machine learning prediction system. Unfortunately for me, we didn’t get to the long-coveted deployment stage before project funding wore out. During my retrospection about how the project went and ways I could improve on my next machine learning project, I was thinking about the hurdles the client might encounter if they tried to deploy the project at some point in the future when I am not there. Although I documented my work thoroughly, the client will likely still encounter gaps in information transfer, broken commands or links, and unexpected results from data processing scripts. The accumulation of these challenges, which can be categorized under a concept called technical debt, is important to manage during any machine learning project.
I began to realize that defects outside of traditional model deployment can also be classified under technical debt: things like machine bias and model opaqueness. This realization sparked my curiosity: how might these risks differ from traditional software development risks, or even risks specific to production machine learning systems, and how can technical debt management reframe the way we mitigate fairness and transparency risks?
First, let’s start off by defining technical debt. Technical debt, defined as “not quite right code which we postpone making it right,” is a concept introduced by Ward Cunningham in 1992.[1] Its original purpose was to provide a rationale for what we now call code refactoring, and includes the long-term costs incurred by short-term development in software engineering. It can be thought of as analogous to financial debt, where the extra effort to add new features to a software program is like interest on a debt.[2] Like paying off financial debt, a high-level technical debt management strategy is to “pay off” the “principal” before the “interest”; that is, addressing software modules with internal quality deficiencies, spending more time on areas of code that are modified more frequently, to lessen the cost (the “interest”) of modifying those modules in the future. Software quality improvement includes actions like refactoring code, improving unit tests, and improving documentation.[3] The ultimate goal of technical debt management is to improve the ability of systems to be maintained over time without significant interruptions in service. The concept of technical debt has been expanded over the years with the help of new tools[5] and taxonomies[4].
A 2012 IEEE Software article asserts that technical debt may not involve code, but may be the result of architectural or structural choices or technological gaps.[6] This statement is especially true of machine learning systems, which present risks separate from traditional software engineering. In a machine learning system, technical debt can manifest in a number of ways. Below, I list the technical debt risks most relevant to FATE (fairness, accountability, transparency, explainability). These risks were proposed by Sculley et al (2015).[3]
- Unstable data dependency: If input data is unstable, then the behavior of a machine learning system using that data can change over time in unpredictable ways. Examples of unstable data dependencies are mappings whose key-value pairs change over time and mis-calibrated data encodings that are later corrected. One method of mitigating this risk is to use data versioning to encapsulate stable data extracts.
- Underutilized data dependency: Features that are not needed in the input data, such as legacy features no longer in use or correlated features, can impact a model’s predictive power. The authors recommend using leave-one-feature-out evaluation methods to identify unneeded features. Feedback loops: Feedback loops occur when machine learning systems influence their own behavior as they update over time. They can be direct or hidden. Iin the hidden case, two systems influence each other in the world (think of a machine learning system that predicts which products to show, and another that predicts product reviews). Feedback loops are difficult to diagnose and remedy.
- Pipeline instability: A typical machine learning workflow contains code that manipulates data using operations like joins, sampling, filters, et cetera, sometimes generating intermediate file outputs. If these pipelines are created ad hoc, then they may generate bugs and failures that are difficult to properly diagnose and mitigate. Pipelines are a special case of what the authors call “glue code”, which is often opaque utility code crafted to facilitate data preparation tasks. To mitigate pipeline instability, the authors recommend that engineers and researchers work together to develop ML packages to reduce the risk that packages appear to be black boxes.
- Protype smell: Prototypes can be useful for initial model development, but small-scale results may not accurately reflect the phenomena the system seeks to measure at full-scale.
- Fixed thresholds in dynamic systems: If the distribution of new data changes from the original data used to develop a ML system, then previously established thresholds may no longer be valid for achieving the desired model performance.
- Prediction bias: In the supervised learning case, this bias occurs when the distribution of predicted labels is not equal to the distribution of observed labels and can be detected via automated testing.
These risks directly map to FATE concerns. For ensuring transparency and explainability of ML systems, pipeline instability can make it difficult or impossible for others outside the engineering team to successfully run ML packages. In an ad-hoc data preparation pipeline, it may not be clear what intermediate outputs should be expected, and the order in which steps should be run. This opaqueness reduces the pipeline’s reproducibility and consequently renders it less useful over time as the ML package is maintained by different engineers on the team and encounters distributional data shifts. Additionally, hidden feedback loops may make it difficult to parse ground truth data-generating processes and understand how the ML system should be behaving versus its actual behavior.
The technical debt risks above also present fairness risks. Unstable data dependencies can mean that if the data for a minority group changes more frequently than for the majority group, the ML system may have lower or more unstable predictive power for minority groups over time. Underutilized data dependencies, like legacy features, may represent historical patterns of discrimination that should no longer be reflected in the data. An example of a legacy feature that may impose fairness-related harms is outdated racial classifications containing terms for minority groups that reflect historical social prejudice. Outdated classifications may introduce unwanted signals of societal bias into the data. Correlated features are of significant concern in mitigating fairness-related harms, as they may encapsulate the signals of minority group data excluded from the input dataset (a famous example of a correlated but presumed innocuous feature is zip code, which can encapsulate societal racism due to the history of redlining). Feedback loops also can present fairness concerns, as it may not be clear how an indirectly-related system impacts a model’s predictive power for minority groups.
Prototype development, fixed thresholds, and prediction bias are all impacted by the dynamic nature of machine learning systems. A prototype dataset may not be representative of all demographic groups due to missingness, which may translate to poorer performance for those groups when the prototype model is deployed to production and has access to the full-scale data. Fixed thresholds and the discrepancy between predicted and output labels may also be more variable for minority than majority groups if those groups are not properly accounted for during data preparation. Dynamic system risks require diligent, automated model and data monitoring, data versioning, and careful design of models and datasets to anticipate and address potential fairness-related harms.
How might we use the framework of technical debt to tackle fairness and explainability issues? A popular management strategy is improving the integrity and quality of the original software architecture and code base to make modifications and additions easier in the future. For mitigating FATE concerns, this strategy includes ensuring the original ML system itself is fair, transparent, and secure, and that these factors take precedence when considering additions or modifications to the system. FATE analogues for common technical debt resolution strategies might include:
- Refactoring code: Refactoring is the process of restructuring existing code to improve its design or structure while preserving its external behavior. Reducing code complexity, including by prioritizing glassbox or explainable models during development, can make the logical steps within easier to follow and thus facilitate transparency and explainability.
- Improving tests: Improving testing across the board is one way to catch disparities in outcomes both before models are deployed and after they are running in a production environment. As well as unit testing small-scale results (such as asserting that predictions for a minority group are within some threshold of predictions for the majority group), ML monitoring systems should also include fairness-specific evaluation metrics. In a monitoring scenario, engineers would set the acceptable thresholds for metrics like demographic parity, equality of odds, etc, and configure alerts for when the metric values fall outside of the threshold.
- Reducing dependencies: Data quality has a significant impact on the fidelity of ML systems. Obviously enough, reducing data dependencies is one way to bolster data quality. Prioritizing data with stable input labels and versioning data during the data manipulation process can mitigate instability in model results. As discussed above, legacy and correlated features may reflect patterns of social bias that ML models should not capture. Taking care to identify and remove these fields can reduce the chance that embedded social biases end up in downstream model results.
- Improving documentation: Robust documentation makes ML systems easier to maintain over time. Documentation frameworks like Model Cards for Model Reporting[7] and Datasheets for Datasets[8] give data science teams transparency into the model development process that persists across staff turnover.
I’d like to pause here to note that fairness issues are fundamentally socio-technical challenges, and technical approaches to mitigation may be insufficient for a given use case. In general, software tools can mitigate harms, but cannot solve them. Before ML software development begins, the most useful harms mitigation strategy may be to ask whether ML is appropriate for solving the problem at hand. A ML system that is inappropriately applied to a social problem presents significant risks outside of technical debt. The technical debt risk mitigation strategy leverages a holistic approach to assuring software is maintainable over time. When machine learning is a core part of a software system, that holistic approach must include considering if the effort of developing a potentially high-risk system and deploying it in a dynamic environment is worth the potential harms it may engender to system stakeholders.
Thanks for reading this post!
Citations
- W. Cunningham, “The WyCash Portfolio Management System,” Proc. OOPSLA, ACM, 1992; http://c2.com/doc/oopsla92.html.
- M. Fowler, “Technical Debt,” blog, 2009; http://martinfowler.com/bliki/TechnicalDebt.html.
- D. Sculley et al, “Hidden Technical Debt in Machine Learning Systems”, NeurIPS, 2015.
- I. Gat, ed., “Special Issue on Technical Debt,” Cutter IT J., vol. 23, no. 10, 2010.
- M. Fowler, “TechnicalDebtQuadrant,” blog, 2009; https://martinfowler.com/bliki/TechnicalDebtQuadrant.html.
- P. Krutchen, R. Nord, I. Ozkaya, “Technical Debt: From Metaphor to Theory and Practice”, IEEE Magazine, 20212.
- M. Mitchell et al, “Model Cards for Model Reporting”, FAT*, 2019; https://www.seas.upenn.edu/~cis399/files/lecture/l22/reading2.pdf.
- T. Gebru, et al, “Datasheets for Datasets”, Proceedings of the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning, 2018; https://www.microsoft.com/en-us/research/uploads/prod/2019/01/1803.09010.pdf.