LLMs, AI, and the Future of Research Evaluation: A Conversation with Mike Thelwall on Informetrics and Research Impact
LLMs, AI, and the Future of Research Evaluation: A Conversation with Mike Thelwall on Informetrics and Research Impact
Shalini Urs
AI as a Muse and a Maverick Co-Scientist?
2016 marked the year Artifcial Intelligence (AI) came of age, as noted by The Guardian. In 2025, it would be an understatement to say that AI has profoundly impacted organizations, societies, and individuals. Initially, AI relied on traditional algorithms such as neural networks, decision trees, and support vector machines, which required structured data and had limited capabilities. Over time, advancements in deep learning and reinforcement learning introduced more sophisticated models like convolutional and recurrent neural networks, enabling the processing of unstructured data such as text, images, and audio. The rise of natural language processing (NLP) led to the development of algorithms like Word2Vec and BERT, revolutionizing industrial automation.
A significant milestone in AI’s evolution was the emergence of Large Language Models (LLMs), which transformed the way machines understood and generated human language. Early LLMs, such as OpenAI’s GPT-2 and Google’s BERT, demonstrated the potential of transformer-based architectures in processing vast amounts of text data. These models leveraged deep learning techniques to analyze language context, improving applications in machine translation, text summarization, and conversational AI. However, despite their advancements, these models had limitations in coherence, contextual awareness, and generating truly human-like responses.
—Technological breakthroughs have always been disruptive, but what sets ChatGPT apart is the speed and intensity of its impact—
AI’s evolution has also fueled significant advancements in chatbots. Traditionally, chatbots relied on NLP to map user queries to predefined responses. However, with deep learning and language models, they have evolved to provide real-time, human-like interactions. OpenAI’s ChatGPT, built on the Generative Pre-trained Transformer (GPT) architecture, exemplifies this shift, leveraging transformer-based neural networks to predict and generate coherent text. With GPT-3 and its successors, AI now has vast applications, ranging from software development to creative writing and business communication. However, its widespread adoption has also raised concerns, particularly in academia and education, where AI-generated content challenges the distinction between human and machine authorship. While current models rely heavily on deep learning, future advancements may increasingly integrate reinforcement learning to further enhance AI’s capabilities.
Technological breakthroughs have always been disruptive, but what sets ChatGPT apart is the speed and intensity of its impact. The viral Shoggoth meme aptly captures the emergent zeitgeist, embodying both the fear and fascination surrounding this powerful heuristic tool. While ChatGPT is remarkably accurate, its unpredictability remains a challenge, and even its creators do not fully understand how LLMs acquire new abilities or why they behave unexpectedly.
Beyond transforming communication and automation, generative AI is reshaping science and research workflows. AI is already accelerating scientific progress—generating ideas, summarizing literature, analyzing data, predicting outcomes, and more. In this evolving landscape, research evaluation itself must adapt to measure a new republic of science, increasingly co-created by AI.
Dan Atkins, who chaired an Expert Committee of the National Academies of Sciences, Engineering, and Medicine on AI and automated research workflow technologies (ARWs), discusses how these innovations can propel research and scientific discoveries into new frontiers at an accelerated pace in an episode of InfoFire. Physicist Mario Krenn describes AI as a muse, inspiring novel scientific discoveries, while others see it as a maverick co-scientist. Generative AI is not just assisting science—it is redefining it.
A recent opinion paper, “So What if ChatGPT Wrote It?”, presents multidisciplinary perspectives on the opportunities, challenges, and implications of generative conversational AI for research, practice, and policy.
LLMs, Research Evaluation, and SciMetrics
Scott W. Cunningham, in his keynote on “Scientometrics in the Era of Large Language Models,” highlighted the profound impact these technologies may have on scientometric research. He expressed concerns about the fundamental shifts they could bring, discussing both the significance and the challenges of these changes. Cunningham also explored potential strategies for integrating these advancements into scientometric research practices.
SciMetrics, or Research Metrics, encompassing Scientometrics, Bibliometrics, Infometrics, and Altmetrics, is dedicated to measuring scientific progress through publications and research outputs. With the rapid advancement of Generative AI, capable of producing vast and diverse scientific content, the question arises: how must SciMetrics evolve to ensure the integrity, relevance, and fairness of research evaluation? What challenges and opportunities do AI-generated scientific contributions pose for traditional research metrics?
These are critical questions that researchers and research evaluators are grappling with today.
In research evaluation, LLMs are transforming how we assess scholarly impact and institutional performance. Traditional methods, such as citation analysis and h-index calculations, provide quantitative indicators but often miss contextual nuances. LLMs can bridge this gap by analyzing full-text content, identifying emerging research trends, assessing sentiment in peer reviews, and detecting interdisciplinary linkages that conventional metrics might overlook. Additionally, LLMs can assist in evaluating research beyond bibliometric indicators—such as policy influence, media outreach, and societal impact—by mining diverse data sources. However, their integration into research evaluation raises critical questions about biases, transparency, and the interpretability of AI-driven insights.
In this episode of InfoFire, I sit down with Professor Mike Thelwall, a well accomplished scholar of Informetrics, to explore the intersections of Large Language Models (LLMs) and research evaluation. We delve into how LLMs are reshaping the landscape of research assessment, examining the promises they hold and the challenges they present in ensuring fair, meaningful, and context-aware evaluations.
As is the aim of InfoFire, to weave personal and professional journeys into the evolution of a specific topic or domain, our conversation begins with a question about Professor Thelwall’s shift from pure mathematics research to informetrics research.
A Serendipitous Journey into Infometrics
The term Informetrics was introduced in 1979 by Otto Nacke, a German documentalist and medical information specialist, and popularized by the British information scientist Bertie Brookes. The International Society for Scientometrics and Informetrics (ISSI), an international association of scholars and professionals active in the interdisciplinary study science of science, science communication, and science policy, was founded at the International Conference on Bibliometrics, Informetrics and Scientometrics held in Berlin in September 1993. The first of these was held in Belgium in 1987 and was organized by Leo Egghe and Ronald Rousseau.
Informetrics is the study of the quantitative aspects of information, offering insights into patterns of information production and usage. It employs bibliometric and scientometric methods to address issues related to literature information management and the evaluation of science and technology. In one of the seminal works on the subject, Egghe and Rousseau (1990) provide a comprehensive introduction, outlining the key laws and methods of informetrics.
When asked about his own academic path, Mike explains that he began as a pure mathematician, only to discover that pursuing advanced mathematical problems didn’t quite fit his interests or skill set. After a few false starts, he found himself writing computer programs for automated student assessment, especially valuable for large classes. This detour rekindled his research ambitions, albeit outside the realm of pure mathematics.
A twist of fate in the university library led him to the Journal of Documentation and a paper on “Web Impact Factors” by Peter Ingwersen. Intrigued by the idea of capturing hyperlink data from the web, Mike realized that his programming background positioned him perfectly to collect and analyze web-based metrics. He wrote a follow-up study, got it published, and inadvertently stepped into the world of infometrics—a happy accident, as he puts it, thanks to a shelving coincidence that placed library and information science journals right next to computer science.
In retrospect, Mike describes this as an ironic turn: “I became an information scientist by making a library mistake.” Yet this so-called mistake proved fortuitous, allowing him to carve out a niche in webometrics by writing programs to gather and analyze online data.
From Webometrics to Altmetrics
I introduce the term SciMetrics to encompass Bibliometrics, Informetrics, Altmetrics, and Scientometrics (BIAS)—four widely used terms over the past decades. At its core, this domain focuses on measuring science through research outputs. Fassin and Rousseau (2023) analyzed the rise of these terms, comparing their usage across three major databases. Their study highlights a rapid increase in metrics-related research, underscoring its growing significance in science. While web(o)metrics and informetrics have plateaued, bibliometrics and scientometrics remain dominant, with bibliometrics now used five times more frequently than scientometrics.
Evolution of Metrics in Science
- Bibliometrics, rooted in publication and citation analysis, laid the foundation for research impact assessment. Eugene Garfield’s Science Citation Index played a pivotal role in institutionalizing citation-based metrics, establishing it as a fundamental infrastructure and data resource.
- Scientometrics extends bibliometric principles to evaluate and predict scientific activity trends. Influenced by Derek de Solla Price, whose works Little Science, Big Science (1963) and Networks of Scientific Papers (1965) provided empirical and conceptual tools for the field.
- Webometrics emerged with the advent of the internet, expanding bibliometric influence beyond academia while challenging traditional paradigms (the “Web turn”). Björneborn and Ingwersen (2004) define webometrics as the study of quantitative aspects of web-based information structures.
- Altmetrics (alternative metrics), introduced in 2010, supplement traditional citation-based impact measures by incorporating online engagement—social media mentions, downloads, and article-level metrics. The Public Library of Science (PLOS) pioneered Article-Level Metrics (ALM) in 2009, offering insights beyond citations.
The Wikipedia article on Bibliometrics traces the evolution of the field, beginning with Paul Otlet’s 1934 term bibliométrie, which he defined as ‘the measurement of all aspects related to the publication and reading of books and documents.’ The anglicized term bibliometrics was first introduced by Alan Pritchard in 1969. Over time, the field has expanded, eventually evolving into what is now referred to as Quantitative Science Studies (QSS).
Reflecting on the evolution of infometrics, Mike notes that citation analysis once stood as the near-exclusive source of quantitative indicators for evaluating academic research impact. Over time, however, research evaluators recognized the need to assess non-academic impacts, prompting the search for alternative data sources. Early webometrics approaches—collecting hyperlinks or web citations—were promising but cumbersome to automate and thus never fully commercialized.
The real breakthrough arrived with the rise of social media and the advent of Application Programming Interfaces (APIs). Platforms like Twitter allowed programmatic access to data on public engagement, providing a treasure trove of new metrics. This shift paved the way for altmetrics—a term coined by Jason Priem—encompassing a broad range of online indicators that capture how research resonates beyond traditional scholarly circles.
Companies such as Altmetric.com and Plum (later acquired by Elsevier) rapidly commercialized these methods, offering data dashboards that track everything from policy citations to social media mentions. For Mike, this signaled a move from largely theoretical exercises to more practical, real-world applications. Suddenly, measuring “non-academic” impacts was not only feasible but also scalable.
Beyond the Ivory Tower: Social Media, Scholarly Communication, and Measuring Research Impact
The rise of digital technologies has profoundly reshaped research, collaboration, and scholarly publishing. Opening Science: The Evolving Guide on How the Internet is Changing Research, Collaboration, and Scholarly Publishing (2014), edited by Sönke Bartling and Sascha Friesike, explores this transformation, emphasizing social media’s role in Open Science. By enabling real-time knowledge exchange, fostering collaboration, and increasing research visibility, social media has become a powerful tool in scholarly communication. The book also examines altmetrics as alternative impact indicators, highlighting how social media accelerates dissemination and fosters a more open, accessible scientific ecosystem.
Web 2.0 has transformed research publication, discovery, and sharing both within and beyond academia, while also introducing innovative frameworks for measuring the broader scientific impact of scholarly work.
Oliver et al. (2023) investigate social media’s role in enhancing public engagement with science. They explore communicative observations as an emerging analytical technique, offering deeper insights into the interactions between scientists and the public. By examining these dynamics, they provide a framework for understanding how digital platforms shape the discourse around scientific research.
Historically, science policy was guided by the assumption that research impact should be assessed solely through scientific standards. Over time, this perspective has shifted, recognizing the need for science to demonstrate its relevance and value to society. While peer review and bibliometrics remain the primary methods for assessing research impact within academia, no universally accepted framework exists for measuring its societal impact. In response, altmetrics have emerged as a promising alternative, capturing public engagement through indicators such as social media mentions, policy citations, and news coverage. Bornmann (2014) provides a comprehensive overview of altmetrics, discussing their potential to measure societal impact while also acknowledging their limitations.
Mike argues that while citation-based indicators continue to dominate research evaluation, altmetrics have broadened the scope to include evidence of non-academic impact. The ability to measure public engagement makes it harder to ignore research that resonates beyond academia. However, he concedes that altmetrics remain supplementary rather than decisive, typically serving to support rather than replace traditional metrics in the support of expert judgment.
Despite the initial enthusiasm for social media in scholarly communication, Mike notes a growing fatigue among researchers. With numerous platforms—Twitter, Facebook, LinkedIn, and emerging alternatives—navigating where and how to engage has become increasingly complex. Many scholars are also reassessing the extent of their public sharing, leading to a potential decline in social media’s role in academic discourse. As research communication continues to evolve, striking a balance between visibility, engagement, and scholarly rigor remains a key challenge.
Beyond Citations: Empirical Insights into Research Evaluation and Impact
Research evaluation is a systematic process for assessing the quality, impact, and effectiveness of research activities and outputs. It plays a crucial role in academia, science, and policymaking by ensuring efficient resource allocation, achieving research objectives, and maximizing societal benefits. Research evaluation serves multiple purposes, including assessing individual researchers, institutions, funding programs, and the broader research ecosystem.
Various methods are employed in research evaluation, ranging from qualitative assessments to quantitative metrics. Quantitative evaluation often utilizes SciMetrics, which encompasses BIAS. Among these, citation analysis is one of the most widely used methods.
Two notable books in this field are Citation Analysis in Research Evaluation by Henk F. Moed (2005) and Beyond Bibliometrics: Harnessing Multidimensional Indicators of Scholarly Impact, edited by Blaise Cronin and Cassidy R. Sugimoto (2014). Moed (2005) provides an in-depth exploration of citation analysis as a tool for evaluating scientific research, explaining the theory and methodology behind citation metrics and their application in assessing the impact of scientific work, researchers, journals, and institutions.
Cronin and Sugimoto (2014) offer a comprehensive examination of the evolving methods used to assess scholarly performance and research impact. Featuring contributions from leading experts, the book explores the limitations of traditional bibliometric indicators and discusses the development of alternative metrics, often referred to as “altmetrics.” It addresses the history, critiques, methods, tools, and ethical considerations associated with these new approaches, providing insights into how multidimensional indicators can offer a more nuanced understanding of scholarly influence.
Mike Thelwall’s book, Quantitative Methods in Research Evaluation: Citation Indicators, Altmetrics, and Artificial Intelligence, available on arXiv, critically examines the role of citation data, altmetrics, and AI in assessing research impact. It explores indicators used to evaluate articles, scholars, institutions, and funders, analyzing their strengths, limitations, and the broader challenges of using metrics in research assessment.
Responding to my question about his book, Mike Thelwall acknowledged that it serves as a follow-up to Henk Moed’s seminal work and, in some ways, as a tribute to it. In positioning his book as a continuation, Thelwall emphasized its unique empirical grounding. Unlike Moed’s book, which was written at a time when empirical validation of citation-based indicators was limited, Thelwall’s work is more evidence-based. He explained that he was granted access to a vast dataset of expert-scored journal articles from the UK’s Research Excellence Framework (REF)—a national evaluation exercise that allocates substantial funding based on expert assessments of research quality. Leveraging this dataset, he systematically compared expert judgments with citation-based indicators and altmetrics across 34 distinct fields. The book presents a series of graphs illustrating these relationships, offering empirical insights into how citation-based indicators and altmetrics align with expert evaluations.
The result is a clearer, evidence-based picture of how well these quantitative metrics correlate with human assessments of research quality. According to Thelwall, citation indicators can be valuable but have limitations, especially when evaluating impact beyond academia. While altmetrics show promise, they are not yet widely adopted. By openly sharing his analyses and making the final version of his book open access, Thelwall aims to democratize discussions around research assessment and foster a more transparent approach to evaluating academic impact.
LLMs, Research Evaluation, and the REF: Surprising Insights and Emerging Trends
National research evaluation systems are structured frameworks used by governments and institutions to assess the quality, impact, and efficiency of scientific research. These systems influence research priorities, funding allocations, and policy decisions, combining both qualitative and quantitative methods such as peer review, citation analysis, and altmetrics. While some countries focus on performance-based funding, others emphasize broader societal impact, reflecting variations in national priorities, institutional structures, and policy goals. Chris L. S. Coryn’s Models for Evaluating Scientific Research: A Comparative Analysis of National Systems (2008) systematically compares different national approaches to research evaluation.
However, national research evaluation exercises come with significant costs, both direct (millions to hundreds of millions of dollars) and indirect (staff time and institutional resources), raising questions about their cost-effectiveness and impact. Critics question the efficiency and administrative burden of these systems, highlighting the limitations of both qualitative measures like peer review and quantitative metrics. With the rise of Large Language Models (LLMs), such as ChatGPT, there is growing interest in automating aspects of peer review, prompting studies to assess their feasibility and effectiveness.
The Research Excellence Framework (REF), introduced in 2014, evaluates research quality in UK higher education institutions and informs the allocation of £2 billion in public funding annually. Managed by Research England and other UK funding bodies, the REF ensures accountability, benchmarks research performance, and guides funding distribution.
In this evolving landscape, many research evaluation scholars are exploring the potential of LLMs as efficient and effective tools for assessing research quality and impact. Mike Thelwall, a pioneer in this area, has made significant contributions to the field. In his recent work using ChatGPT to assess academic article quality, he uncovered three key findings:
- It Works at All: ChatGPT’s quality scores align positively with expert judgments across multiple fields.
- It Beats Citations: ChatGPT’s evaluations surpass citation-based indicators in predicting research quality.
- Abstract-Only is Enough: ChatGPT’s assessments based solely on an article’s title and abstract outperform citation metrics, offering cost-saving and copyright benefits.
Mike clarifies that ChatGPT doesn’t “look up” articles online but relies strictly on the text provided by users. While it may sometimes generate inaccuracies or “hallucinations,” its overall predictive power remains strong. As Shalini suggests, these occasional inaccuracies can spark creativity, much like science fiction inspiring real-world innovations.
Mike envisions ChatGPT playing a role akin to citations in major research evaluation systems like the UK’s REF. Rather than replacing expert review, LLM-based assessments could serve as supplementary data points, particularly when expert opinions differ. ChatGPT’s flexibility could also extend its relevance to fields where citation metrics are less applicable. However, its use raises questions about transparency, reliability, and ethics, underscoring that human judgment remains essential.
Transparency and Trust in LLM-Based Assessments
Transparency and trust are core principles of Responsible AI, a framework designed to guide the development, deployment, and use of AI in a way that builds confidence among organizations and stakeholders. Many governments and agencies are leading the charge in promoting Responsible AI across industries. The Responsible AI Institute (RAI) collaborates with policymakers, industry leaders, and technologists to develop AI benchmarks and governance frameworks. One of our previous InfoFire episodes with Carolyn Watters delved into Responsible AI.
Large Language Models (LLMs) have reached a critical juncture in ensuring their responsible development and deployment. Establishing transparency, a central tenet of Responsible AI, is essential for LLMs and their applications. You and Chon (2024) conducted a systematic review exploring the current research landscape on trust and safety in LLMs, with a particular focus on LLMs’ novel applications in the Trust and Safety field. Huang et al. (2024) introduced TrustLLM, a comprehensive study on the trustworthiness of LLMs. This study includes principles for evaluating trustworthiness across various dimensions, established benchmarks, and analysis of mainstream LLMs, while also discussing open challenges and future directions.
Mike’s experiences—ranging from early webometrics to cutting-edge ChatGPT applications—highlight a recurring theme: new data sources and technologies can reveal aspects of research impact previously overlooked.
Given that LLMs can provide plausible evaluations, yet their decision-making processes remain opaque, I asked: How should researchers integrate these tools to ensure transparency and trustworthiness in assessments?
Mike explains that when he prompts ChatGPT to assign a quality score to a research article, he also asks for an explanation. While ChatGPT offers a reasoning response, it is not definitive. In theory, it could provide an equally plausible rationale for a different score, reflecting the inherent ambiguity in both AI- and human-generated evaluations.
Mike notes that this is similar to peer review, where two experts may disagree on an article’s merit, each offering equally credible justifications. He concludes that complete transparency is elusive, whether the evaluator is an LLM or a human. However, acknowledging the limitations of the technology—being transparent about what it can and cannot do—promotes responsible usage.
When asked about whether ChatGPT’s “reasoning” feature improves transparency, Mike explains that while it can be helpful for some tasks, he hasn’t found it directly beneficial for research quality scoring. He sees potential for future improvements, but for now, the tool’s explanations still require human scrutiny.
The Future of Infometrics: From Citations to Full-Text Analysis
With the immense capabilities of LLMs to process vast amounts of text in nanoseconds, ChatGPT has become a widely used tool for scholars, aiding in content creation and review. It helps refine work or evaluate others through extraction, summarization, and assessment of papers. The role of LLMs in scholarly communication is being extensively studied, as fields like SciMetrics, informetrics, and bibliometrics shift from citation-based approaches to those focusing on content and context. Scholars such as Thelwall are exploring LLMs’ impact by examining their efficiency and effectiveness in research evaluation.
Given Mike’s background in mathematics, I inquired about the role of mathematics and computational models in shaping infometrics. He believes that computational power, rather than purely mathematical modeling, will drive the transformation. Full-text analysis, for example, could yield deeper insights than citation data alone—particularly if paywalls and copyright restrictions are addressed.
Mike envisions a future where AI-powered tools can analyze the full text of scholarly works, offering a nuanced understanding of how knowledge is transferred, how methods are shared or adapted, and whether citations reflect genuine engagement or merely acknowledgment. Sentiment analysis could also assess whether citations are positive, neutral, or critical. However, as Mike points out, many fields discourage overt criticism in published articles, complicating any automated sentiment analysis.
Will AI Take Over?
A common question in the AI era is whether AI will eventually render certain academic roles obsolete. Mike is skeptical. He sees LLMs as valuable tools that streamline tasks and enhance human decision-making, rather than replace it. Similar to citation metrics, LLM-based scores should complement expert judgment rather than replace it.
Mike anticipates that infometricians will spend more time explaining the limitations of AI-based evaluations. He believes that LLMs will expand the field’s toolkit, raising new questions about methodology, bias, and ethics—ultimately enriching, not diminishing, human expertise. While LLMs can offer new data points and analytical capabilities, they cannot fully replace the complexities of human subjectivity. The future of research evaluation lies in leveraging AI’s strengths while recognizing the need for critical oversight and interpretive nuance, ensuring that technology enhances rather than replaces human insight.
Cite this article in APA as: Urs, S. LLMs, AI, and the future of research evaluation: A conversation with Mike Thelwall on informetrics and research impact. (2025, March 13). Information Matters, Vol. 5, Issue 3. https://informationmatters.org/2025/03/llms-ai-and-the-future-of-research-evaluation-a-conversation-with-mike-thelwall-on-informetrics-and-research-impact/
Author
-
Dr. Shalini Urs is an information scientist with a 360-degree view of information and has researched issues ranging from the theoretical foundations of information sciences to Informatics. She is an institution builder whose brainchild is the MYRA School of Business (www.myra.ac.in), founded in 2012. She also founded the International School of Information Management (www.isim.ac.in), the first Information School in India, as an autonomous constituent unit of the University of Mysore in 2005 with grants from the Ford Foundation and Informatics India Limited. She is currently involved with Gooru India Foundation as a Board member (https://gooru.org/about/team) and is actively involved in implementing Gooru’s Learning Navigator platform across schools. She is professor emerita at the Department of Library and Information Science of the University of Mysore, India. She conceptualized and developed the Vidyanidhi Digital Library and eScholarship portal in 2000 with funding from the Government of India, which became a national initiative with further funding from the Ford Foundation in 2002.
View all posts