Wikipedia Infoboxes: The Big Data Source for Knowledge Bases behind Alexa and Siri Virtual Assistants

Wikipedia Infoboxes: The Big Data Source for Knowledge Bases behind Alexa and Siri Virtual Assistants

Shalini Urs and Mohamed Minhaj

When you (or your kid) says “Hey Siri” or “Alexa,” ask a question and get your right (or wrong) answers, did you know that there are a host of technologies and knowledge bases behind it? Virtual assistants like Alexa, Siri, and others do their jobs better thanks to Wikidata, the not-so-well-known product of Wikipedia Foundation. Knowledge bases such as Wikidata represent everything in the universe in a way computers can understand. An army of volunteers maintains the knowledge base that serves an essential (though mostly unannounced) purpose as AI and voice recognition technologies expand to every corner of digital life. Denny Vrandečić, programmer and regular Wikipedia editor (now with Wikimedia Foundation), founded Wikidata in 2012. Vrandečić recognized the need for humans and bots to share knowledge on more equal terms as computers lack common sense and language depends a lot on common sense.

—Knowledge bases such as Wikidata represent everything in the universe in a way computers can understand.—

Semantic Web and AI

Sometimes big ideas catch attention, but it takes tremendous research and development (not to mention time) to make it happen. When Tim Bernes Lee proposed the Semantic Web idea to make the Internet data machine-readable, it caught the world’s attention (Berners-Lee et al., 2001). For the machines to interpret the data, data’s context and meaning or semantics aught be embedded. Web technologies such as Resource Description Framework (RDF) and Web Ontology Languages (OWL) were developed to formally represent metadata describing concepts, relationships between entities, and categories of things. Embedding semantics offers significant advantages such as reasoning over data and operating with heterogeneous data sources.

Wikipedia—the go-to place for instant knowledge
Wikipedia is the go-to place for instant knowledge of everything under the sun. Its English edition alone has around 6 million articles. It has become a valuable and universal resource and hugely successful with its collaborative but curated content-creation model. However, the limited search features of Wikipedia and the form in which its data is stored throws up many challenges for its use by humans and direct interpretation by machines. Despite these shortcomings, Wikipedia attracts the information research community because its comprehensive data has a well-formed structure and hierarchical categorization. Several attempts have been made to extract the unstructured and semi-structured data of Wikipedia and transform it into structured, and semantically enriched KBs simplify the effective use of the knowledge concealed in Wikipedia.

Wikipedia Infoboxes
Wikipedia Infobox is a fixed-format table usually added to the top right-hand corner of articles to consistently summarize the articles’ unifying aspect and facilitate navigation to other related articles. For instance, books have information about the “Subject” and “Publisher”; therefore, adding an infobox to articles on books makes it easier to quickly find such information and compare it with that of other articles (Figure 1). The use of infoboxes is neither required nor prohibited for any article. Instead, through discussion and consensus, Wikipedia editors consider including an Infobox, which infobox to include, and which parts of the infobox to use for each article. Infoboxes contain essential facts and statistics of a type that are common to related articles. They are like fact sheets, or sidebars, in magazine articles, and they quickly summarise essential points in an easy-to-read format. Infoboxes contain structured data enveloped in the textual content of Wikipedia articles. Given the structured nature of data and the possibility of mapping its schema quickly to many prominent metadata systems, the data in Wikipedia’s infoboxes is widely used for many knowledge-based applications.

Wikipedia uses several infobox classes to store information about different entities such as “Person” and “Place.” Each of these classes has several sub-classes. For example, “Person” has several sub-classes used to store information about a specific type of person, such as “Scientist.”

Wikipedia Infobox (Thinking Fast and Slow) {{Infobox book
| name = Thinking, Fast and Slow
| image = Thinking, Fast and Slow.jpg
| caption = Hardcover edition
| border = yes
| author = [[Daniel Kahneman]]| cover_artist =
| country = United States
| language = [[English language]]| series =
| subject = [[Psychology]]| genre = [[Non-fiction]]| publisher = [[Farrar, Straus and Giroux]]| release_date = 2011
| media_type = Print ([[hardcover]], [[paperback]]), audio
| pages = 499 pages
| isbn = 978-0374275631
| oclc = 706020998
| preceded_by =
| followed_by =

Figure 1. Infobox Template used to display information about a Book in the Wikipedia article and its source code.

Knowledge base (KB)
The KBs have a long history dating to the expert systems of the 1970s (Ratner & Ré, 2018). However, they gained prominence in recent times because of their use in many semantic web applications and machine-learning activities. A KB is a vast collection of knowledge about the world. The data in the KB is generally gathered from multiple sources and by numerous people. The collected data is integrated and stored in a structured form for easy access by users. A crucial feature of the contemporary KBs is that they are machine-interpretable; besides being used by humans, they are machine-friendly and possibly be employed for automated tasks. It is the computer understandability of the modern KBs, making them fit in the machine-learning setting.

In light of the fast-growing collection of figures and facts in Wikipedia amassed collaboratively, several KBs have been constructed based on Wikipedia. These include DBPedia (Auer et al., 2007), YAGO (Suchanek et al., 2007) and Wikidata (Vrandečić & Krötzsch, 2014). Certain KBs have been built for private use, whereas others are available for public use. While some have been manually built, others have been constructed in semi- or fully automated mode. These KBs are not only enabling effective use of Wikipedia’s concealed knowledge by humans, but they are also facilitating the better and faster interpretation of knowledge by machines. In addition, these KBs have resulted in the development of many smart applications, including Question-Answering Systems.

Whereas Wikipedia in general and its infoboxes, in particular, serve as a vital source of figures and facts for KB creation, the profusion of facts calls for a mechanism to refine and organize them in a machine-friendly form. The use of ontologies is one of the popular methods to represent knowledge in a structured form suitable for machine interpretation. Ontology is defined as a formal and explicit specification of a shared conceptualization (Gruber, 1995). Conceptualization refers to an abstract representation of the domain that we would like to model for a specific purpose. The ontology primarily contains the description of concepts, properties, and relationships among the concepts.

Two Popular KBs that have leveraged on Wikipedia Infoboxes: DBpedia and Wikidata

The primary objective of the DBpedia project was to convert Wikipedia content into structured knowledge, which could facilitate the use of Semantic Web techniques, such as asking sophisticated queries against Wikipedia, linking it to other datasets on the web, or creating new applications. The project developed an information extraction framework to convert Wikipedia content into RDF and consisted of 103 million RDF triples (Auer et al., 2007). Besides developing a web interface to access the knowledge base, the open and linked nature of the DBpedia facilitates linking its content to other open KBs and integration with other semantic technologies. After its initial version was developed in 2007, a significant global community has continuously improved and extended the DBpedia project. DBpedia has been the precursor of many successful KBs that are used today. Many projects have employed DBpedia for prototyping and proofs-of-concept and have been instrumental in many semantic technology innovations. Many enterprises, such as Apple, Google, and IBM, have adopted the idea of data extraction from DBPedia for their high-visibility AI projects: Siri, Google Knowledge Graph, and Watson, respectively.

Wikimedia Foundation, Germany, started the Wikidata project to create a central storage for the structured data of its sister projects, including Wikipedia. With the collaborative and multilingual nature of Wikipedia, the same piece of information appears in articles in many languages. Further, the same piece of information also appears in many articles within a single language collection. For example, besides being available in both English and Italian articles about Rome, Rome’s population is also available in the English article Cities in Italy. However, the numbers on all these pages are different. The goal of Wikidata is to overcome these problems by creating new ways for Wikipedia to manage its data on a global scale. Wikidata, launched in 2012, is one of the widely used, accessible, and open knowledge bases that can be read and edited by both humans and machines.

Wikidata enriches catalogs and collections of GLAMs:
Wikidata unlocks nearly two decades of data collection and curation by volunteers to create a language-independent, linked, open, and structured database that is usable and friendly for both people and computers. Volunteer communities worldwide unleash the multilingual and global collaboration of Wikipedia to index and describe topics as varied as food, art, and medicine. In addition, Wikidata connects other databases and collections of information, allowing computers and software to see connections between hundreds of data sources.

Galleries, Libraries, Archives, and Museums (GLAMs) communities partner with the Wiki community through various projects. Structured data through Wikidata allows collaboration with GLAMs at scale, building upon more than a decade of experience with hundreds of active partnerships as part of the GLAM-Wiki program. In addition, there are diverse strategies and tactics for enriching, connecting, and learning from heritage collections, spreading the data across dozens of Wikipedia language communities and external applications.

These collaborations have enriched the GLAMs catalogs to go beyond traditional metadata by putting their data and materials within the context of the broader digital landscape and making context visible. One great example is the Museum of Modern Art in New York (MOMA) which has integrated Wikidata and associated Wikipedia articles into “artist” pages in their online catalog and regularly runs a wide range of edit-a-thons, in which participants communally create and update Wikipedia entries on a given topic.

As more and more museums and other memory institutions (GLAMs) open up for collaborations with Wikidata, it is a win-win for both as Wikidata’s reliability is further enhanced with curated data from the catalogs of GLAMs and GLAMS collections and catalogs are enriched with contexts and rich visualizations.

The demonstrated efficiency of knowledge bases for semantically enriched data-powering AI or other smart applications propelled research on knowledge bases. Several large-scale knowledge bases have been created in recent times, and many of these projects have leveraged the Big Data concealed in Wikipedia infoboxes. These large-scale, cross-domain, and open-access knowledge bases have turned into fertile ground for many AI applications such as Question-Answering systems. For example, DBpedia was used as one of the knowledge sources in IBM Watson’s Jeopardy winning system (Ferrucci et al., 2010). Wikidata’s structured dataset has been used by virtual assistants such as Apple’s Siri and Amazon’s Alexa (Simonite, Tom, 2019). The knowledge bases enable effective use of Wikipedia’s knowledge by humans, but they also facilitate better and faster interpretation of knowledge by machines. Further, with the open nature of knowledge bases, the web as a whole is transforming into an extensive collection of interlinked semantically enriched, Linked Data (Bizer et al., 2011).

Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., & Ives, Z. (2007). Dbpedia: A nucleus for a web of open data. In The semantic web (pp. 722–735). Springer.
Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The semantic web. Scientific American, 284(5), 34–43.
Bizer, C., Heath, T., & Berners-Lee, T. (2011). Linked Data: The Story so Far [Chapter]. Semantic Services, Interoperability and Web Applications: Emerging Concepts; IGI Global.
Ferrucci, D., Brown, E., Chu-Carroll, J., Fan, J., Gondek, D., Kalyanpur, A. A., Lally, A., Murdock, J. W., Nyberg, E., Prager, J., Schlaefer, N., & Welty, C. (2010). Building Watson: An Overview of the DeepQA Project. AI Magazine, 31(3), 59–79.
Gruber, T. R. (1995). Toward principles for the design of ontologies used for knowledge sharing? International Journal of Human-Computer Studies, 43(5–6), 907–928.
Introducing the Knowledge Graph: Things, not strings. (2012, May 16). Google.
Ratner, A., & Ré, C. (2018). Knowledge Base Construction in the Machine-learning Era: Three critical design points: Joint-learning, weak supervision, and new representations. Queue, 16(3), 79–90.
Simonite, Tom. (2019). Inside the Alexa-Friendly World of Wikidata | WIRED.
Suchanek, F. M., Kasneci, G., & Weikum, G. (2007). Yago: A core of semantic knowledge. Proceedings of the 16th International Conference on World Wide Web, 697–706.
Vrandečić, D., & Krötzsch, M. (2014). Wikidata: A free collaborative knowledgebase. Communications of the ACM, 57(10), 78–85.

Cite this article in APA as: Minhaj, M., & Urs, S. (2021, November 23). Wikipedia Infoboxes: The big data source for knowledge bases behind Alexa and Siri virtual assistants. Information Matters.  Vol.1, Issue 11.

Shalini Urs

Dr. Shalini Urs is an information scientist with a 360-degree view of information and has researched issues ranging from the theoretical foundations of information sciences to Informatics. She is an institution builder whose brainchild is the MYRA School of Business (, founded in 2012. She also founded the International School of Information Management (, the first Information School in India, as an autonomous constituent unit of the University of Mysore in 2005 with grants from the Ford Foundation and Informatics India Limited. She is currently involved with Gooru India Foundation as a Board member ( and is actively involved in implementing Gooru’s Learning Navigator platform across schools. She is professor emerita at the Department of Library and Information Science of the University of Mysore, India. She conceptualized and developed the Vidyanidhi Digital Library and eScholarship portal in 2000 with funding from the Government of India, which became a national initiative with further funding from the Ford Foundation in 2002.