IR Meets AI: Looking Back, Looking Forward — A Conversation with Stephen Robertson, Pioneer of Information Retrieval

April 17, 2025 Shalini Urs

Shalini Urs

Right Info, Right Person, Right Time: The Eternal Quest of IR

While F.K.W. Drury articulated the high purpose of book selection as “to provide the right book for the right reader at the right time” in his 1930 work, Book Selection, S.R. Ranganathan codified this principle a year later in his Five Laws of Library Science (1931), specifically through the second law, “Every reader his book,” and the third law, “Every book its reader.” Decades later, Teri Lesesne revived and popularized this ideal in her 2003 book Making the Match: The Right Book for the Right Reader at the Right Time, aimed at young adult readers.

Information Retrieval (IR) is the science and engineering of searching for relevant information in large collections—especially unstructured text. IR included the process of indexing, querying, and ranking information objects—typically documents or web pages—to satisfy a user’s information need. It draws on algorithms, relevance models, and user behavior to return the most useful results. Over the past seven decades, IR has evolved from its origins in set-theoretic and logic-based approaches to probabilistic models that reckon with uncertainty, context, and user intent.

This journey has witnessed models shaped by ideas from mathematics, statistics, linguistics, and increasingly, machine learning and AI. From the Boolean model and vector space model to probabilistic relevance frameworks and language modeling, IR’s history is marked by shifts in both formalism and philosophy.

“TF-IDF is purely heuristic.”
“TF-IDF is not a model, it is a weighting scheme in the vector-space model.”
“LM is a clean probabilistic approach.”
“LM is full of hacks and holes.”

“LLMs are not search engines—they are storytellers.”

“AI is neither artificial nor intelligent—it’s engineered inference.”

Such comments, heard often at IR/AI conferences, reflect both the depth and the debate that characterize this dynamic field. IR has always been a landscape of competing intuitions, elegant formalisms, and practical hacks—all in pursuit of the same goal: relevance.

—Along with Karen Spärck Jones, Robertson developed the probabilistic relevance framework (PRF) and the BM25 ranking function, which remains a foundational baseline in both academic and industrial IR systems —

Stephen Robertson: Architect of Probabilistic IR

Among the luminaries who shaped modern IR, Stephen Robertson stands tall. He is a pioneer, having laid the foundations of the field—particularly the probabilistic models of retrieval.

Along with Karen Spärck Jones, Robertson developed the probabilistic relevance framework (PRF) and the BM25 ranking function, which remains a foundational baseline in both academic and industrial IR systems.

Yet, for someone whose influence spans decades, there is no canonical book by him on IR theory. Why?

“Why did you never write a book?”

“I started one, and even a second one; but I never finished them, because I found it impossible to keep the notation consistent.”

—Stephen Robertson, quoted in Information Retrieval Models: Foundations and Relationships by Thomas Roelleke (2013)

This quote by Robertson is wonderfully revealing in multiple ways. It reveals not only his characteristic humility and dry British wit, but also something essential about IR itself. Writing a book on IR is not merely a narrative challenge—it is a conceptual and notational juggling act. Over the years, retrieval models have sparked intense debate and persistent confusion, especially when aiming for elegance, consistency, and clarity.

In this episode of InfoFire, I speak with Dr. Stephen Robertson, a pioneering figure in the field of Information Retrieval (IR), who offers a reflective account of its conceptual and methodological evolution. The conversation explores:

The foundational models that shaped the early trajectory of IR
The development and enduring influence of evaluation methodologies, notably the Cranfield paradigm
The theoretical and practical implications of the probabilistic turn in IR
The increasing integration of machine learning techniques into IR systems
And future directions for the field considering recent advances in Retrieval-Augmented Generation (RAG), Large Language Models (LLMs), and AI-native approaches to search

In tracing these threads, we hope to honor the intellectual legacy of IR pioneers while also some of the telling personal moments that highlight the bygone eras and the conflicts and rivalries between legendary figures and British and American systems, and finally exploring how some of the ideas are being reimagined in today’s AI-infused world.

Understanding IR Models: A Brief Taxonomy

Over the decades, researchers have proposed a wide variety of models to capture the elusive task of retrieving relevant information from large document collections. These models differ not only in mathematical underpinnings but also in how they conceptualize the relationship between documents and queries.

Set-theoretic models—such as the Standard Boolean, Extended Boolean, and Fuzzy retrieval models—treat documents and queries as sets of terms. Similarity is assessed through set operations like intersection or union, offering binary relevance decisions but lacking nuanced ranking.
Algebraic models represent documents and queries as vectors or matrices. The classic Vector Space Model (VSM) and its enhancements (like Latent Semantic Indexing) compute similarity via scalar measures (e.g., cosine similarity), allowing ranked results based on geometric interpretation.
Probabilistic models bring statistical inference to IR. Pioneered by Maron and Kuhns and later refined by Stephen Robertson and others, these models—such as the Binary Independence Model, BM25, and Language Models—estimate the likelihood that a document is relevant to a given query using principles like Bayes’ Theorem.
Feature-based models, prevalent in modern learning-to-rank frameworks, treat documents as collections of feature values and optimize scoring functions using machine learning. They provide a flexible architecture that can integrate signals from various retrieval paradigms.

This taxonomy captures the evolving sophistication of IR, from symbolic operations on sets to probabilistic reasoning and learning from data. As IR matured, so did its models—each striving for greater expressiveness, interpretability, and performance.

Before and Beyond BM25

Proponents of probabilistic models view the information retrieval problem as one of inference and prediction, advocating for the use of probabilistically weighted indexes and ranked outputs. This line of thinking was first formulated and written up by Maron and others in August 1958, with the first journal article proposing a probabilistic approach published in 1960 by Maron and Kuhns. Thompson (2008) reflects on the seminal article by Maron and Kuhns, highlighting its foundational role and lasting influence on the field of information retrieval.

Around the same time, Bill Cooper at Berkeley independently developed the Probability Ranking Principle (PRP), though he did not publish it then. Other researchers, such as Miller and Barkla, were also exploring similar methods involving term weighting. Meanwhile, Karen Spärck Jones introduced the influential concept of inverse document frequency (IDF), a heuristic that quickly demonstrated its practical effectiveness.

Karen Spärck Jones and the Birth of IDF: The Enduring Heuristic

In 1972, Karen Spärck Jones published a seminal paper titled “A statistical interpretation of term specificity and its application in retrieval”, introducing what would later be known as inverse document frequency (IDF). Based on counting the number of documents containing a given term, IDF provided a simple yet powerful heuristic: terms that appear in many documents are poor discriminators and should be down weighted, while rarer terms are more valuable for retrieval.

This insight represented a major leap in IR. When combined with term frequency (TF)—which gives more importance to terms that appear more frequently in a document—it formed the foundation of the TF-IDF weighting scheme. Despite the development of more complex models, TF-IDF has remained remarkably robust and widely used. Its influence extends far beyond text retrieval, underpinning modern search engines and numerous applications in natural language processing (NLP).

RSJ and OKAPI: Toward Probabilistic Ranking

In 1976, Stephen Robertson and Karen Spärck Jones proposed a probabilistic framework for IR, known as the RSJ probabilistic model. This model laid the groundwork for ranking documents based on their estimated probability of relevance.

During the early 1980s, Gillian Venner, Nathalie Mitev, and Stephen Walker conducted pioneering research on online public access catalogs (OPACs) at the Polytechnic of Central London (PCL), culminating in the development of a prototype system named OKAPI—short for “Online Keyword Access to Public Information.” This early work on OPACs predated web search engines such as Archie.

In July 1989, OKAPI moved to the Centre for Interactive Systems Research at City University, where further development continued under Robertson, Walker, and others. The team participated in the U.S. NIST TREC (Text REtrieval Conference) initiative to refine term-weighting algorithms and retrieval strategies.

OKAPI’s implementations included various “Best Match” models, which combined global and local weighting strategies—using RSJ weights globally and parameterized term frequencies locally. The best-known result of this evolution is BM25, a ranking function that remains a cornerstone of modern probabilistic IR systems.

In a comprehensive 2009 report, Stephen Robertson and Hugo Zaragoza revisit and review the Probabilistic Relevance Framework (PRF). Rooted in research from the 1970s and 1980s, the PRF led to the development of BM25, a highly effective and widely adopted text retrieval model. Later, the framework evolved to incorporate document metadata—particularly structural and link-graph information—resulting in BM25F, a variant optimized for web and enterprise search. The report outlines the conceptual basis of PRF, including key models such as the binary independence model, relevance feedback mechanisms, and the transition from BM25 to BM25F. It also explores the connection between PRF and other statistical models for information retrieval, along with discussions on integrating non-textual features and optimizing parameters in models with free variables.

BM25: From TREC Challenge to Enduring Benchmark

I asked Dr. Robertson about its inherent strengths and why it has remained a benchmark in IR research and practice.

Dr. Robertson responded by saying that the development of BM25 involved a degree of luck. He recounted that the RSJ model was applied in various places, including the OKAPI library catalog developed by his colleague Stephen Walker in the late 1980s. Dr. Robertson then discussed how Steve Walker’s project was transferred to City University, where they continued to collaborate.

A significant turning point, according to Dr. Robertson, was the emergence of TREC (Text REtrieval Conference). While they had conducted testing with Cranfield and other test collections, TREC presented a new challenge with its unstructured text databases, unlike the library catalogs they were accustomed to. Dr. Robertson noted that the RSJ model did not perform well in TREC’s early stages. He identified a key reason for this: the lack of normalization for term frequency, a feature that Salton’s SMART system had, which normalized by document length.

Dr. Robertson explained that incorporating term frequency into the probabilistic model was a challenge. He and Keith van Rijsbergen had previously explored this in another project using abstracts, which inherently have a term frequency component. Although they had some initial ideas, they hadn’t developed a robust method to integrate it with the probabilistic model. Dr. Robertson continued to work on this problem during the first year of TREC.

He noted that the size of the databases provided by TREC was a significant obstacle. While 2 gigabytes might seem small today, it was a substantial amount of data at the time. Technical difficulties hindered their progress in TREC 2, and they were unable to submit proper results, lagging behind the leaders.

However, Dr. Robertson was confident they could improve. This challenge pushed him to refine the model. They finally achieved success in TREC 3, performing extremely well. Dr. Robertson believes that this period of intense work led to the development of a strong ranking algorithm. He emphasized its enduring quality, noting that it has been used for 30 years and is often employed as a baseline for comparison in information retrieval research. While researchers often manage to slightly outperform it, he stated that substantially beating it is quite difficult. He concluded that BM25’s effectiveness stems from its development during the demanding early years of TREC.

If I Were to Design BM25 Today

In response to a question about whether he would change anything about BM25 if developing it in hindsight, Dr. Robertson stated:

Looking back, I don’t think I would fundamentally change BM25. At the time, we experimented with various refinements, including more complex models, but they often proved difficult to implement effectively. While I might describe BM25 differently today, its core form would remain largely the same.

One extension I worked on at Microsoft was a field-weighted variant of BM25. This is particularly useful when dealing with structured documents—such as library catalogs—with well-defined fields like titles, abstracts, full text, and keyword metadata. The idea was to assign different weights to different fields, for example, giving more weight to title words.

When I joined Microsoft, I noticed that some teams were applying BM25 to multi-field data in an ineffective way. My colleagues and I developed a proper field-weighted adaptation, which proved beneficial for those specific use cases. Ultimately, while BM25 is not designed for the complexities of modern web search, it remains a robust and widely respected algorithm in information retrieval, still holding its ground even after three decades.

Competition and Collaboration in the Evolution of Information Retrieval

IR has deep roots in both the UK and the US, with a rich history shaped by diverse research groups and communities. I asked: Could you take us through the journey of how key figures influenced and inspired one another in the development of IR? How did the interplay between competition and collaboration—across institutions and geographies—help propel the field forward?

Robertson responded with an expansive and characteristically witty reflection:

That’s a great question. I think both collaboration and competition played a vital role in how information retrieval developed as a field. When I first entered the domain, it was quite small—you could practically know everyone in the field, certainly within the UK, and even a good number globally.

I got to know some of the key people early on—Gerard Salton and his group in the US, for example, and of course, the UK community, including Cyril Cleverdon, who designed the Cranfield experiments. My own introduction to IR came during my master’s at City University, under Jason Farradane. I believe Nick Belkin also mentioned Farradane in your interview with him. Farradane was a fascinating character—he coined the term “information science” and worked extensively to build a professional identity for the field in the UK. He led the department at City where I studied, and while we didn’t always agree, working with him was intellectually stimulating.

One amusing and somewhat intense episode from those early days: my master’s thesis—eventually published as two papers in Journal of Documentation—criticized some aspects of the Cranfield methodology. Cleverdon, unhappy with my take, called up the director of Aslib (my employer at the time) and threatened to sue for libel! Thankfully, the director was a diplomatic figure and diffused the situation. Interestingly, my relationship with Cleverdon improved over time. Despite our early clash, I always held deep respect for what he accomplished. The Cranfield experiments, for all their limitations, were groundbreaking. He was the first to seriously evaluate retrieval systems in a rigorous way, and that contribution remains foundational.

After my master’s, I began a part-time PhD and reached out to Karen Spärck Jones about re-analyzing Cranfield data. That meeting was the beginning of a long and fruitful collaboration. Around that time, she had just published her IDF (Inverse Document Frequency) paper, which really caught my imagination. It became the springboard for what would eventually evolve into the RSJ model—the probabilistic model we co-developed.

The development of RSJ itself is a story of inspiration and academic sparring. Karen had received a preprint from one of Salton’s students suggesting a way to reorder documents within levels of coordination. I remember saying, “Surely we can do better than that!” That kicked off my attempt to create a truly probabilistic model, drawing on ideas like Cooper’s Probability Ranking Principle and earlier work by Maron and Kuhns. I corresponded with both Cooper and Maron, who were based in Berkeley, and their insights shaped my thinking.

When Karen and I submitted our RSJ paper in 1976 to JASIS, it was sent to Salton for review. Today it’s unthinkable to know your referee’s identity, let alone engage with them directly, but in this case, Karen knew Salton and ended up in a direct correspondence with him, debating the paper’s merits. I fed her counterarguments while she did the writing. Salton had many criticisms, but eventually—grudgingly—he agreed to its publication. That paper, to my surprise, became my most cited work, and interestingly, Karen’s as well—despite her extensive contributions to areas like natural language processing.

I also had the opportunity to visit the US for the first time soon after. I met Cooper and Maron in Berkeley, and then visited Salton’s lab at Cornell, where I presented a seminar to his students. They seemed convinced by the RSJ model—perhaps more so than Salton himself, who was still skeptical at the time. But even that slightly uneasy relationship improved over time.

Other key figures in my journey included my PhD supervisor, B.C. Brookes, a statistician with a strong interest in bibliometrics. Though his focus wasn’t directly on IR, his statistical approach helped shape my thinking. Michael Keen was another important influence—he had worked on the original Cranfield project and spent time in Salton’s lab, making him something of a bridge between the UK and US schools of thought.

It’s also worth noting a disciplinary divide that shaped the field—the distinction between Library and Information Science (LIS) schools and Computer Science departments. Karen, though based in Cambridge’s Computer Lab, had an arts background that gave her sympathy for LIS perspectives. Cooper and Maron were in Berkeley’s LIS school, while Salton was firmly rooted in computer science. This divide sometimes meant people weren’t always talking to each other, but when they did, it could be incredibly productive

The evolution of IR was driven by this dynamic interplay of rivalry and partnership, across both national and disciplinary boundaries. That blend of competition and collaboration was critical.

Cranfield and the Evolution of IR

The Cranfield project—the second one especially—was designed to adjudicate between competing library-science-based approaches to retrieval: controlled vocabularies, classification systems, relational indexing, and so on. Ironically, almost everyone who had a stake in those methods ended up disappointed by the results, which suggested that using natural language—just plain text—could be more effective than the more structured systems they championed.

The scale of Cranfield was modest by today’s standards, and you wouldn’t rely on its results in isolation now. But it was a landmark moment—it helped pivot the field away from traditional classification-based approaches and toward what became core IR techniques: text indexing, term weighting, evaluation metrics like recall and precision. That pivot was contentious, but ultimately, it pushed the field forward. While classification schemes and subject headings remain in use (e.g., in scientific abstracts), the scale of modern information access has rendered manual classification impractical.

A Tale of Two Models: Bridging RSJ and Maron-Kuhns in the Age of Machine Learning

Given the two competing models—Maron and Kuhns’ relevance weighting model and the Robertson–Sparck Jones (RSJ) probabilistic model—attempts were made in the early 1980s, and later, to unify them, though these efforts ultimately did not come to fruition.

I asked about the tension between the RSJ probabilistic model of IR and the earlier approach of Maron and Kuhns (1960), as well as your 1981 effort to unify the two—a piece of work you’ve described as interesting but, so far, not satisfactorily concluded. If you were actively conducting IR research today, would you revisit this unresolved conflict? And if there were a time machine, do you think continuing that work in 1981 might have altered the trajectory of IR—and if so, how?

That’s a fair question. Let me try to explain the conflict between the two models in simple terms.

In the Maron and Kuhns (1960) paper, they imagined a librarian sitting with a document, trying to classify it. The idea was to imagine the kinds of people who might find this book interesting, and what kinds of queries those users might submit. A good way to index the book, then, would be to use the kinds of words that such users might use in their queries. So, the book is the concrete object in front of you, and the users are an abstract class. You’re trying to index the book in a way that aligns with this class of potential users.

In contrast, with the Robertson–Sparck Jones (RSJ) model, we start with a user and their query. The user submits a query, gets back some documents, and may find some of them useful. Now the question is: how are these useful documents characterized? Can the system—or the user—reformulate the query to find more documents like these?

So, in the RSJ approach, we’re talking about an abstract class of documents like the few that were found useful, and a known, specific user. It’s the reverse of the Maron and Kuhns formulation. The RSJ principle is to adjust the query to find similar documents. Maron and Kuhns, on the other hand, focus on adjusting the indexing of the document to reach the right users.

Now, if you throw both models up in the air—neither fixed—you’ve got a problem. You can’t easily modify one to fit the other while simultaneously modifying the second to fit the first. Or at least, it’s not at all obvious how to do that. That’s the core of the conflict. These days, with developments in machine learning and statistics, one might approach this with a kind of Bayesian framework. You’d start with some initial assumptions—a sort of prior—about both the document and the query, and then adjust them gradually to bring them into alignment. Not too much, though; you don’t want to throw away the original.

I had a student who explored this line of thinking. The model he built was interesting and worked to some extent—it passed some tests. But it was far too complex. No one really took it up, and I couldn’t see how to simplify it. If I were starting again today, I would take that kind of approach, but I’d need help from strong machine learning experts to figure out how to do it systematically.

Interestingly, this kind of problem—how to bring users and items together—is very much what happens in the world of recommendation systems. Think about Amazon. They observe users and their preferences, and they have products that appeal to certain types of users. The recommendation engine tries to match these two.

Now, I personally find that world a bit alien and overly commercial. It’s not something I’ve wanted to get into. But that is precisely the space where the tension between the Maron–Kuhns and RSJ models, and possible ways to reconcile them, becomes meaningful.

So, if I had to revisit the question, I might have to “hold my nose,” so to speak, and work with recommendation data—because there’s a lot of it—and try to build a model that makes sense in that context.

Why BM25 Refuses to Die: BM25, Machine Learning, and the Trade-off Between Interpretability and Performance

In response to my query—Would you say that your other probabilistic models—such as BM25, TF-IDF, and even PageRank—are grounded in explicit mathematical formulations? In contrast, machine learning approaches aren’t necessarily based on such formulations. Do you think they sacrifice interpretability for performance? —Robertson said:

I do think interpretability is a challenge with machine learning-based retrieval—and with machine learning in many domains, in fact. Let me share an interaction I had with some machine learning folks while I was at Microsoft.

BM25 was being used in various contexts at Microsoft, including in Bing when it launched. What the machine learning people found was this: BM25 combines features in a specific way, based on a formal model—as you noted. But from their perspective, they preferred working in environments where you have many weak features, and the machine learning algorithm learns how to combine them.

They didn’t like that BM25 had a specific functional form that wasn’t accessible to their learner. While they could learn how to combine BM25 with other features, BM25 always emerged as a very strong signal on its own. They found it annoying that they couldn’t significantly outperform it.

Now, they could achieve similar performance without using BM25 by combining lots of weak signals—as is common on the web—but it took considerable effort. So, although I don’t know if BM25 is still used in today’s large-scale web search engines, given how much machine learning goes into them now, it may well have been phased out.

Still, that approach depends on having vast amounts of training data. In almost any other environment—where you don’t have web-scale data—it’s still hard to beat BM25. Its specific functional form has proven to be remarkably robust and difficult to replicate or improve upon using machine learning from scratch.

There are, of course, areas of IR that don’t use BM25—or even ranking at all. Legal retrieval is one such area. The legal domain still heavily relies on Boolean retrieval. We’ve largely abandoned Boolean retrieval since the 1990s in most contexts, but lawyers continue to use it.

They construct incredibly complex Boolean queries to search legal corpora or large datasets released for legal purposes—like email collections from major trials. I think it’s tied to interpretability. Legal professionals believe they understand Boolean logic—they think they know exactly what will retrieve a document and what won’t. I’m not entirely convinced they always do, especially when they construct page-long Boolean queries. But yes, the appeal is that it’s interpretable. That’s what kept Boolean search alive for so long, even when there were decent ranking models available. For some users, interpretability outweighed ranking performance.

From Cranfield to Clicks: Modeling Relevance in a Noisy, Subjective World

Relevance is not only difficult to define but also challenging to measure, as it is inherently complex, relative, and situational. What a document or piece of information means to an individual is fluid, continuously shaped by who the person is, where they are, and the context in which they engage with it. It is ultimately the user who defines relevance, making it a subjective and often ambiguous construct. In my doctoral thesis, I undertook an extensive examination of the concept of relevance, drawing particularly on Nicholas Belkin’s theory of “anomalous states of knowledge” and his model of structural change. This analysis highlighted the multiple interacting dimensions that shape relevance: the structure of the text, the cognitive structure of the recipient and the transformations it undergoes, as well as the structure and intent of the sender. Furthermore, individual cognitive styles play a significant role in shaping how relevance is perceived and judged. I concluded that, much like beauty, relevance lies in the eyes of the beholder.

This user-defined and dynamic nature of relevance has long been acknowledged in the field of library and information science. Notably, the OCLC’s initiative to reorder Ranganathan’s Five Laws retains the second law—“Every reader his or her book”—and provides a thoughtful elaboration: as Ranganathan foresaw, delivering “every person his or her book” is a demanding and exacting task. It requires not only an understanding of the current information needs and preferences of users within a community, but also the foresight to anticipate and meet their evolving future requirements. This principle resonates strongly with the contemporary emphasis on user-centered design, which reaffirms the primacy of the user—not only in shaping services and systems but also in defining the very notion of relevance itself.

In response to my query referencing Cooper’s reflections on the probabilistic ranking principle (PRP) as both a boon and a burden—and his concerns about its potential suboptimality—Robertson offered the following explanation:

Yes. And I think one aspect that might be worth revisiting here is relevance. I know that’s a topic of interest to you.

The TREC world has generally treated relevance as an absolute property—something a judge can determine definitively. But I’ve always aligned more with the Cranfield interpretation of relevance. In Cranfield, the searcher formulates a query and, ideally—though not always in practice—the searcher themselves judges relevance. In fact, Cranfield used a graded relevance scale: four levels of relevance plus a “non-relevant” category, making five levels in total. These ranged from “provides a complete solution to your problem” to “is of marginal use.” It was very much a judgment-based assessment made by the user.

So, for me, relevance has always been a subjective user judgment. It’s not a matter of logic or hard criteria—it depends on how the document resonates with the user’s cognitive state, their context, and even their willingness to make connections. For instance, whether I consider a document relevant might depend on my mood, or more rationally, on how much effort I’m willing to invest in relating that document to my problem. If it’s well written, I might engage with it and find relevance. If it’s poorly written, I might give up and deem it irrelevant.

So, relevance, in my view, is inherently noisy. It’s variable and situational. And that’s exactly why I think probabilistic models are so appropriate—they don’t claim to know what’s relevant. They say, “This may help you.”

Information Retrieval research sought to formalize the inherent subjectivity of relevance by developing relevance feedback mechanisms in retrieval systems—an effort that later evolved into tracking user clicks and behavior.

Robertson agreed and responded thus: Yes, exactly. But that also brings in inter-user variability. If another user submits the same query and clicks a particular document, that’s a signal. But as we know, a query—especially a two- or three-word query—is hardly a full representation of the user’s actual information need, or what Belkin called the anomalous state of knowledge.

So there’s no strong reason to assume that two different users issuing the same query will make the same relevance judgment on the next document that comes up. That variability across users comes on top of the inherent subjectivity within each individual’s judgment.

Is AI a Paradigm Shift in IR? Or Just an Extension?

With the rise of the Web, IR research shifted toward language models, treating retrieval as a generative task—estimating the probability that a document could produce a given query. Techniques like query likelihood with smoothing became standard. Models like Divergence from Randomness (DFR) ranked documents based on how their term distributions deviated from randomness. Learning to Rank (LTR) methods introduced machine learning to optimize ranking using query-document features. More recently, dense retrieval and semantic search bridged the lexical gap by matching meaning rather than exact terms, culminating in Retrieval-Augmented Generation (RAG), which integrates dense retrieval with large language models for more context-aware responses.

In response to my question, “Are neural and generative models a paradigm shift in IR or a natural evolution—and what’s next for retrieval?” Robertson responded:

I think machine learning has been influencing web search engines and other retrieval environments for quite a long time now—20 years at least. It’s built upon basic IR principles but moved beyond them in some ways. Still, that movement has only been possible in contexts like web search where there’s an abundance of training data. In other environments where such data is scarce, the core IR principles remain very relevant.

The web is a somewhat special case. Machine learning has advanced considerably, but when it comes to large language models and AI, we’re entering a different space. Even in the early years of TREC, we were already experimenting with machine learning models in IR—models that learned from relevance judgments. These were homegrown models developed within the IR community.

By the early 2000s, however, the broader machine learning field started contributing more directly, with general-purpose models being applied in IR. So, while IR had its own trajectory of learning models, there’s now a convergence happening with larger AI and ML trends.

The Role of LLMs and RAG in Today’s Search: Influence or Integration?

LLMs are powering what we now call artificial intelligence systems. They’ve been fascinating, but as far as I know, they haven’t yet made a deep impact on core retrieval processes.

Take Google’s AI Overviews, for example. When a user submits a query, the traditional search engine returns results, and then the AI overview engine looks at those results and generates a summary. It doesn’t appear to be involved in the actual search process itself. It’s more like post-retrieval summarization. That’s quite interesting and sometimes useful—though it occasionally produces very odd results!

So currently, LLMs aren’t interacting with the search process in any meaningful way. One could imagine an AI assistant that helps formulate better queries or even perform the search more effectively. In a Boolean retrieval world, we might’ve already seen AI-generated Boolean queries, but that’s not where we are now.

That brings us to Retrieval-Augmented Generation—RAG. It’s an architecture where the AI system is actually engaged with a retrieval component, and that could be a meaningful way for IR and AI to converge. I don’t know how far along RAG systems really are in practice, but the idea is promising.

The Possibility of AI-Assisted Retrieval

For me, AI-assisted retrieval would mean more than just summarizing search results. It would involve an AI that actually helps perform the retrieval process—interacts with the user, understands gaps in the LLM’s knowledge, and turns to retrieval to fill those gaps.

Right now, an LLM is trained on as much data as possible and then distilled into a model that serves as its entire knowledge base. If the AI could somehow detect when that knowledge is incomplete and intelligently invoke a retrieval system to compensate, that would be transformative. I don’t think that’s happening yet—but it could be soon.

So yes, RAG may be a signpost toward this deeper integration—where IR meets AI in a more fundamental way. Until then, what we see is more like AI summarization rather than AI-assisted retrieval.

Looking Back and Looking Ahead: Transformative Moments in IR

In response to a question about the transformative moments in the evolution of Information Retrieval, Stephen Robertson said:

One thing I’ve always found interesting is how the IR field has often been energized by people and ideas coming from outside. Some disciplines can become echo chambers—closed circles of researchers reinforcing their own thinking. But IR has been quite open to new perspectives.

The early days of web search engines are a good example. The people building them had little to no background in IR research, yet their ideas transformed the field.

If you go even further back, the emergence of Boolean retrieval systems in the 1960s was triggered by something external—magnetic tape databases of scientific abstracts. These abstracts were generated by publishers using computer-assisted printing. That availability of structured data opened the door for the first generation of automated retrieval systems.

Then came TREC and the development of probabilistic models like BM25—innovations born within the field, but often responding to external challenges and datasets.

So, the field has grown both from within and from without. That openness to external ideas has led to many of IR’s most significant transformations. If AI continues its current path and becomes integrated into the retrieval process—not just bolted on for summarization—then we may be on the brink of another one.

Anticipatory Search vs. Human Inquiry: The Future of AI in Knowledge Discovery

Knowledge discovery carries different meanings across contexts. In Information Retrieval (IR), it typically refers to uncovering existing knowledge that is unknown to the user, though not necessarily original. However, in a series of influential papers, Don Swanson introduced the concept of undiscovered public knowledge—insights that are neither explicitly nor implicitly stated in the literature but can be inferred by connecting disjoint pieces of information. Davies (1989) reviewed earlier efforts to generate knowledge through retrieval and classification and highlighted techniques for surfacing hidden knowledge, such as serendipitous browsing, effective search strategies, and, potentially, future methods like Farradane’s relational indexing or artificial intelligence.

In parallel, the rise of big data spurred the evolution of the interdisciplinary field of Knowledge Discovery in Databases (KDD), which emerged to transcend the constraints of traditional statistical analysis. This approach combines deductive and inductive reasoning to extract meaningful patterns from massive, complex datasets. Data mining methods—automated or semi-automated—are capable of processing large numbers of interrelated variables to address causal heterogeneity and enhance predictive accuracy. Machine learning further advances this process by building models that iteratively learn from data, especially when explicit model structures are difficult to specify.

These advances ultimately gave rise to the broader domain of data science. In the late 1990s, Jim Gray captured this shift by coining the term The Fourth Paradigm, emphasizing data-intensive scientific discovery as a new mode of knowledge production. He advocated for investment in data infrastructure on par with traditional libraries. His vision was memorialized in the influential volume The Fourth Paradigm: Data-Intensive Scientific Discovery (Hey et al., 2009), which highlighted how the future of science increasingly depends on the ability to discover, retrieve, and synthesize knowledge—both new and previously overlooked.

When asked, “My final question to you concerns the blurring of boundaries in how we retrieve and create information. Traditionally, we retrieve information to create something new, but with the rise of generative AI, that process seems to be shifting. How do you see the future of this evolution? Where do you think it’s headed?” Robertson responded:

Yes, we’re creating information through that retrieval process, and it’s becoming more integrated with generative AI. The idea we discussed earlier, about AI systems helping with search, is something that will likely happen. But I think the vision some have right now is that, in the future, we might never have to actively ask for information at all. Instead, there would always be something anticipating what you need and providing that information even before you realize you need it—gathering sources and presenting them automatically.

There are already glimpses of that in the smartphone world today. But I do have a bit of nostalgia for the process of human research—the deliberate, conscious act of thinking, gathering, and integrating ideas ourselves. I’d be a bit sad if AI completely replaced that. Maybe that’s just nostalgia, and maybe it’s inevitable, but I do hope that doesn’t happen. I guess time will tell.

Cite this article in APA as: Urs, S. IR meets AI: Looking back, looking forward — A conversation with Stephen Robertson, pioneer of information retrieval. (2025, April 18).https://informationmatters.org/2025/04/ir-meets-ai-looking-back-looking-forward-a-conversation-with-stephen-robertson-pioneer-of-information-retrieval/

Author

Shalini Urs

Dr. Shalini Urs is an information scientist with a 360-degree view of information and has researched issues ranging from the theoretical foundations of information sciences to Informatics. She is an institution builder whose brainchild is the MYRA School of Business (www.myra.ac.in), founded in 2012. She also founded the International School of Information Management (www.isim.ac.in), the first Information School in India, as an autonomous constituent unit of the University of Mysore in 2005 with grants from the Ford Foundation and Informatics India Limited. She is currently involved with Gooru India Foundation as a Board member (https://gooru.org/about/team) and is actively involved in implementing Gooru’s Learning Navigator platform across schools. She is professor emerita at the Department of Library and Information Science of the University of Mysore, India. She conceptualized and developed the Vidyanidhi Digital Library and eScholarship portal in 2000 with funding from the Government of India, which became a national initiative with further funding from the Ford Foundation in 2002.
View all posts