Translation

Critical Data Modeling: Tools to Trace Oppression in Information Systems

Critical Data Modeling: Tools to Trace Oppression in Information Systems

Karen M. Wickett

Scholars of library and information studies are called to create new methods and to apply our analytic techniques to how study information systems harm our neighbors and communities. The social impacts of information systems have been studied by scholars in economics (Eubanks, 2018), medicine (Obermeyer et al., 2019), criminal justice (Angwin et al., 2016, Jefferson 2020), and critical information studies (Noble, 2018). However, there is a gap between critiques that describe the social impacts of information systems and the technical aspects of how those systems are designed and built. This gap is significant because the technicalities of information systems obscure the oppressive nature of those systems and their role in perpetuating inequality (Benjamin, 2019). Critical data modeling uses tools from data modeling and systems analysis to analyze information systems, connecting social critiques to the structure of digital information systems and data objects (Wickett, 2023).

Every information system or digital object uses data models. A data model lists the attributes of the real-world things that will be described by a database or dataset. For example, the data model for a class roster might tell us that each student listed on the roster should have a name, an ID number, and the number of credits they are enrolled for. The data model will also give rules for data that go into each attribute. It might require that names should be strings of letters, ID numbers should be 10-digit strings of numbers, and credit hours should be whole numbers between 2 and 4.

—Every data model takes a position on what information is essential and what information can be discarded—

Data models create a simplified version of the world so that we can have useful databases and information systems. Trying to record every possible fact about each student in a class would make a hopelessly long dataset full of unnecessary information. But this also means that every data model takes a position on what information is essential and what information can be discarded for the purposes at hand. These choices influence what we can learn from datasets and how information systems shape our lives. The goal of critical data modeling is to make these choices clear and connect them to the impacts of information systems on people and communities.

A digital object like a class roster file is a complex object that is built up of many layers of expression and encoding. If I download my class roster as a CSV (comma-separated values) file, the information has been arranged as a table and encoded according to the standards for the file format. The data values for each attribute have been expressed with numbers and letters, which are encoded following another standard called UTF (Unicode Transmission Format). Those UTF characters are encoded as binary bitstreams (sequences of 1s and 0s) since our computers use binary representations to record and process information. The Basic Representation Model gives a conceptual model for these levels of representation and encoding, which gives us a path for analyzing datasets to critiques data models, formats and encodings in terms of the impacts on people and communities (Wickett, 2023).

This model supports close readings of datasets that influence communities through public perception and policymaking. The City of Los Angeles publishes an “Arrest Data from 2020 to Present” dataset through their open data portal that lists information about arrests made by the Los Angeles Police Department (LAPD, 2022). The dataset is accompanied by a detailed description of the 25 columns that are used to describe an event where a person was arrested. In addition to reading the dataset documentation, the tags and the available formats for download are evidence for how the dataset is positioned and intended to be used. This arrest dataset is tagged as “public safety” and the available data formats foreground geographic data formats (KML and Shapefile) that can be imported into GIS software to create maps.

The dataset also emphasizes geography through the data model. Geographic information appears at varying levels of granularity in 10 of the 25 attributes, and every row in the dataset includes a value for Location (“The location where the crime incident occurred. Actual address is omitted for confidentiality. XY coordinates reflect the nearest 100 block”(LAPD,2022)). In contrast, there is absolutely no information about the police officers involved in an arrest. While anonymity is a reasonable concern for a publicly available dataset, the people who have been arrested are present in the data model; in terms of their age, gender (labeled ‘Sex Code’) and race (labeled ‘Descent Code’). The imbalance is striking.

The dataset documentation states that “Each row represents an arrest”. There is an assumed correspondence between a row in the dataset and a criminal incident, which is driven home by the example visualizations for the dataset on the open data portal. The portal positions us to map out these arrests and draw conclusions about criminal activities. However, examining the actual data through structured queries shows how problematic it is to make inferences about crime from this dataset. While every row includes Location, rows with missing location data use (0,0) instead of missing data indicator, which places conformance with the geographic data format over accuracy of the dataset. Moreover, many rows in the dataset are about children taken into custody when a parent was arrested, but those rows often still list a location of a criminal incident. This critical reading of the dataset supports arguments about the transformation of information about policing and crime into geographic information, which has significant impacts on how we understand places and people in our communities (Jefferson, 2020).      

Information systems are an essential part of our lives, and the ways they are structured have impacts on people and communities. Critical data modeling is an approach to creating novel close technical readings of information systems and digital objects. By applying data modeling and systems analysis tools in conjunction with studies of the social impacts of information systems, we can reveal the ways that data modeling choices, data format requirements, and information encodings shape our lives.

References

Benjamin, R. (2019). Race after technology: Abolitionist tools for the new Jim code. Polity Press.

Eubanks, V. (2018). Automating inequality: How high-tech tools profile, police, and punish the poor. St. Martin’s Press.

Jefferson, B. (2020). Digitize and punish: Racial criminalization in the digital age. University of Minnesota Press.

Los Angeles Police Department. (2022). Arrest data from 2020 to present. https://data.lacity.org/Public-Safety/Arrest-Data-from-2020-to-Present/amvf-fr72

Noble, S. U. (2018). Algorithms of oppression. New York University Press.

Wickett, K. M. (2023). Critical data modeling and the basic representation model. Journal of the Association for Information Science and Technology, 1– 11. https://doi.org/10.1002/asi.24745

Cite this article in APA as: Wickett, K. M. (2023, April 20). Critical data modeling: Tools to trace oppression in information systems. Information Matters, Vol. 3, Issue 4. https://informationmatters.org/2023/04/critical-data-modeling-tools-to-trace-oppression-in-information-systems/

Karen Wickett

Karen M. Wickett is an Assistant Professor in the School of Information Sciences at the University of Illinois. Her research areas include information organization, metadata, knowledge organization, and data modeling. Wickett is most interested in the analysis of common concepts and data models in information systems. Examining the assumptions and models behind these systems and artifacts can reveal bias and help us understand the role of information systems in societal oppression.