Unifying Telescope and Microscope: A Multi-lens Framework with Open Data for Modeling Emerging Events

Unifying Telescope and Microscope: A Multi-lens Framework with Open Data for Modeling Emerging Events

Yunhe Feng and Chirag Shah

The famous folk tale of The Blind Men and the Elephant tells a story that different perspectives can lead to distinct points of view: the elephant could be recognized as a wall, snake, spear, tree, fan, or rope, depending upon where the blind men had touched. Similarly, when only limited perspectives are offered to investigate and model an event, it is more likely to lead to an unconvincing and even biased conclusion.

Nowadays, the benefits of open data, such as accessibility and transparency, have motivated and enabled many research studies and applications in both academia and industry. However, each open data only offers a single perspective, and its potential inherent limitations (e.g., demographic biases) may cause poor decisions and misjudgments. It is obvious that both traditional and emerging open data have intrinsic and exclusive features, providing unique perspectives but suffering from constraints and bottlenecks at the same time. For example, census data covers a very large population but fails to reflect the monthly and daily changes. Google Trends represent the normalized search interests but not the real search volumes. Social media data, such as tweets, contains heterogeneous information but may introduce demographic biases.

—choosing a data source can be seen as choosing a lens for observations in physical sciences; a telescope and a microscope both allow us to observe, but two very different worlds—

In our recently published IP&M paper titled “Unifying Telescope and Microscope: A Multi-Lens Framework with Open Data for Modeling Emerging Events,” we investigate and summarize the characteristics of the three forms of open data, i.e., census data, search logs, and social data, on eleven aspects, as shown in Figure 1. For instance, as census data contain almost all populations within a given region, its aggregation degree is high. Most, if not all, Internet users rely on search engines to filter and access information online, which leads to a medium aggregation for search logs data. Only registered and active users contribute to social media content, implying a low data aggregation. Accessibility of census data is high because it can be downloaded directly from government websites. But the accessibility of search log data is low because log data is usually protected, and only anonymous samples or aggregated data are available to the public. The social media data can be retrieved using official social media APIs, requiring authentication as a developer or researcher, and thus its accessibility is rated medium. For data diversity, we think social media data have a high amount of diversity because of multimedia content, while the other two lenses have a very limited number of data types.

Figure 1. Characteristics of census data, search logs data, and social media data.
Figure 1. Characteristics of census data, search logs data, and social media data.

When investigating a phenomenon or an event, we think choosing a data source can be seen as choosing a lens for observations in physical sciences; a telescope and a microscope both allow us to observe, but two very different worlds. Here, we consider lenses that cover three levels of observations: macro, meso, and micro. A macro lens can allow us to look at a phenomenon from a distance, covering a large area, but not being very precise. A micro lens, on the other hand, can provide a more specific picture but may be prone to localized fluctuations. A meso lens falls in between these two. While each of these lenses has its relative pros and cons, scientists make choices about which one to use when a more meaningful picture emerges through a careful combination of some or all of these lenses. It seems to be challenging to identify appropriate data source candidates to build these lenses. Thanks to the ubiquitous open data, it offers an excellent opportunity to define and enable multiple lenses to look at events of interest through different eyes.

However, this notion of integrating multiple open-data lenses in a generalized and effortless way is still under-explored. To bridge such research gaps, in our IP&M paper, we propose a universal and easy-to-use framework, incorporating multi-source open data retrieval, feature extraction, and the training and fusion of machine learning models, to investigate events of interest (see Figure 2). Specifically, we instantiate census data as the macro lens because it offers an overall picture of a large area and a large population. Social media data serves as the micro lens to examine the detailed and diverse features of individuals in a timely manner. The aggregated search-engine data, such as Google Trends, is selected as a meso lens because it usually summarizes daily searching patterns generated by a relatively large group of users. Our framework only requires users, who can be researchers and practitioners in industry, academia, and government, to provide event keywords, timelines, and locations by simply answering what, when, and where questions. According to the users’ inputs, our framework retrieves open data from government websites, search engines, and social media respectively, and then conducts feature engineering and builds models automatically.

Figure 2. Overview of the multi-lens framework with open data.
Figure 2. Overview of the multi-lens framework with open data.

With low labor efforts, the framework’s generalization and automation capabilities guarantee an instant investigation of general events and phenomena, such as disasters, sports events, and political activities. To demonstrate the usability and effectiveness of the proposed framework, we take the COVID-19 pandemic and Solar Eclipse of August 21, 2017, as case studies to estimate how COVID-19 progressed across U.S. states and when the total eclipse occurred using individual and collective lenses. To be specific, we use COVID-19 (eclipse) relevant words as keywords to collect multi-source open data, including census data, Google Trends, and Twitter data, in 50 U.S. states and D.C. from April 4 to May 9, 2020 (on August 21, 2017). The census data, as a macro lens, provides an overall demographic distribution pattern, which covers all populations across U.S. states. Google Trends data, as a meso lens, indicates the aggregated search interests of the COVID-19 pandemic and the solar eclipse from millions of U.S. Google search engine users. Twitter data, as a micro lens, enables us to take a closer look at the individuals’ attitudes and behaviors on COVID-19 and the solar eclipse. Then we adopt the two aforementioned mechanisms to perform data fusion and to train seven types of regression models. For estimating daily confirmed COVID-19 cases and deaths, the best performance of models trained through our framework can beat those trained on expert-generated datasets in more than one-fourth of all 50 U.S. states and D.C. More importantly, our approach requires fewer labor efforts to collect, preprocess, and organize data and less domain knowledge to create data features. For the eclipse case study, we find multi-lens models outperform any single-lens-driven models in 33% of U.S. states. We believe the proposed framework is generalizable enough to study a wide range of real-world events and social phenomena using publicly available data.

The openness and high usability of the proposed framework make it easy to be adopted in cross-disciplinary research studies. First, all data used in the framework is open data that is accessible to researchers with very low effort. In addition, it is easy-to-use for those who have little programming skills because they only need to answer the three W’s questions (what, when, and where) of an event to be investigated. Then our open-source data collecting scripts will automatically download the corresponding census data, search logs, and social media data. When the data is ready, feature extractions and model training in the following steps can also perform automatically.

We think the proposed framework benefits a broad research community from different disciplines and domains. For example, researchers in public transportation can take advantage of our framework to explore the emerging transportation systems, e.g., shared dockless electric scooters, in given cities and time periods. The presented framework can also be used to monitor and analyze time-sensitive social and political events, for example, the 2020 United States presidential election. Recall that Google Trends data is updated daily and Twitter data can be retrieved in a real-time manner. Both the two examples require neither intensive efforts in collecting data nor huge budgets.

In many scenarios, it is impractical to call for companies and organizations to make their raw data open access, even for research purposes. Instead, it is more acceptable and reasonable to inquire whether they can release pre-trained models minimizing the risk of customers’ personal information leak. On the one hand, the raw data never leaves company devices, and the access to the raw information is restricted to authorized personnel only. On the other hand, companies can design and provide Application Programming Interface (API) services allowing external users to submit queries to retrieve pre-trained model outputs regarding specified time periods and regions.

We can integrate such a pre-trained model into the proposed framework by treating it as an individual model in the model fusion-based training mechanism. Along with other models trained independently on open data, we concatenate their outputs as the new input features to train a final model. Thus, the lens learned from private data is merged seamlessly into the existing workflow, making our approach more robust at no cost of privacy violations and ethical issues. From another perspective, agencies that hold exclusive data can also leverage our framework partially or entirely to incorporate open lenses into their internal model training operations.

Paper free URL:

Cite this article in APA as: Feng, Y., & Shah, C. (2022, January 7). Unifying telescope and microscope: A multi-lens framework with open data for modeling emerging events. Information Matters.  Vol.2, Issue 1.

Yunhe Feng

I'm a UW Data Science Postdoctoral Fellow with Dr. Chirag Shah at the University of Washington. I received my Ph.D. degree in Computer Science with Dr. Qing (Charles) Cao in 2020 from the University of Tennessee, Knoxville, USA, and obtained my B.E. and M.E. degree from Beijing University of Technology, China. My interest lies in Responsible AI, Mobile Security and Privacy, and Big Data Analytics, Mining, and Modeling. Besides research, I love playing sports, including half marathon, tennis, soccer, and fishing.