Rethinking Reuse in Data Lifecycle in the Age of Large Language Models
Rethinking Reuse in Data Lifecycle in the Age of Large Language Models
Jaihyun Park, Kai Li
The recent surge in AI models can largely be attributed to growing computational power and the unprecedented availability of data. This is why we call them “large” language models (LLMs) even though the idea of “large” is never sufficient to capture the true scales of these systems. As data is generated at an ever-accelerating pace, we are already swamped and disoriented. In the world we are living in, a digital world, some data slips past our awareness, but very little data ever truly disappears. As we, information scientists, are concerned with reproducibility and responsibility of research, data lifecycle models have been developed to manage the complexity. To foster open, transparent, and collaborative science, data is often archived in a repository at the end of the project according to such data lifecycle models. This is often followed by the last step of the lifecycle models, data reuse. Traditionally, this model is cyclical, with reused data leading to new questions and fueling subsequent rounds of research.
—Information scientists need to rethink the lifecycle of research data beyond the conventional notion of “reuse” as the final stage—
But as we enter the age of AI, it’s worth asking: does this traditional data lifecycle model still hold up? And how might it be misaligned with the ways data is created, consumed, and reused in the context of large-scale and complex AI systems?
While there are many different variations of the data lifecycle model, a typical model usually follows a linear process. The process usually begins with data collection or creation. After acquiring data, researchers need to follow the steps to clean the noise, redundant, and/or irrelevant data to prepare it for analysis aligned with their research questions. After analysis, a crucial step is to deposit the dataset to an open data repository, so that the data is preserved and can be found and reused in the future, a practice increasingly mandated by research funders and journals. (See this article for more.)
While the lifecycle may appear to end with archiving, it actually extends into potential future reuse. Enabling data reuse is a key objective of responsible data management. However, a range of socio-technical factors can affect how easily the data can be (re-) used. Many of these factors are addressed by the FAIR principles (i.e., Findability, Accessibility, Interoperability, and Reusability). Many data policies inspired by FAIR strive to promote data reuse by offering better documentation, metadata and more visibility to the data objects.
Especially in the context of AI model training, the origins and collection methods of datasets are often overshadowed by the sheer volume of data. This focus on quantity over quality creates opportunities for biases and inaccuracies to propagate through the models being developed. This is why the data lifecycle model needs renewed attention to think beyond FAIR principles and how we can categorize and itemize the societal impact of reusing datasets. The life of data does not end at the point of reuse; rather, it continues into various “afterlives,” particularly in model training scenarios.. A recent example underscores this concern: researchers found that a Chinese LLM, Deepseek, mistakenly identifies itself as ChatGPT, raising the possibility that it was trained on ChatGPT-generated content. The original motivation for developing LLMs was to infer (or predict) what words come next in the sentence, in the scenario of writing. In this case, we can easily guess that Deepseek reused the data from ChatGPT.
One major caveat of reusing the data from another LLM model is that the new model can amplify the biases in the old model. It has also been found that the AI model can experience “model collapse,” where the quality and diversity of generated outputs degrade over time, if it reuses the AI-generated text for training purposes. To avoid these unwanted consequences, creating and maintaining documentation of how data is reused becomes a critical step within the data lifecycle framework in the new technological landscape.
Information scientists, therefore, need to rethink the lifecycle of research data beyond the conventional notion of “reuse” as the final stage. However, the field of AI is evolving rapidly and AI companies and researchers release newer models at an increasingly faster speed. Because of heated competition among technology companies to develop newer models, there are many different models that look the same from an end-user’s perspective. Yet, many of these models are built on data sources that are poorly documented or entirely opaque. And because of the growing number of models, it is impossible to trace where their training data is from.
This challenge calls for urgent action: we need to revisit and redesign existing data lifecycle models to explicitly address the complexities and consequences of data reuse in AI development. Only through such proactive adaptation can we ensure that data stewardship keeps pace with technological advancement.
Further Readings
Park, J., & Cordell, R. (2023). The ripple effect of dataset reuse: Contextualising the data lifecycle for machine learning data sets and social impact. Journal of Information Science, 01655515231212977. https://doi.org/10.1177/01655515231212977
Park, J., & Jeoung, S. (2022, May). Raison d’être of the benchmark dataset: A survey of current practices of benchmark dataset sharing platforms. In Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP (pp. 1-10). https://doi.org/10.18653/v1/2022.nlppower-1.1
Shumailov, I., Shumaylov, Z., Zhao, Y., Papernot, N., Anderson, R., & Gal, Y. (2024). AI models collapse when trained on recursively generated data. Nature, 631(8022), 755-759. https://doi.org/10.1038/s41586-024-07566-y
Cite this article in APA as: Park, J. & Li, K. Rethinking reuse in data lifecycle in the age of large language models. (2025, April 16). https://informationmatters.org/2025/04/can-ai-help-to-predict-the-scholarly-impact-of-new-scientific-papers/
Authors
-
-
Dr. Kai Li (https://orcid.org/0000-0002-7264-365X) is an Assistant Professor at the University of Tennessee, Knoxville. He is interested in scholarly communication, science of science, and open science.
View all posts Assistant Professor