Is Automated Content Moderation Going to Solve Our Misinformation Problems?

Is Automated Content Moderation Going to Solve Our Misinformation Problems?

Benjamin D. Horne
School of Information Sciences, University of Tennessee Knoxville.

We all use social media. Sometimes, we use it to make decisions about what to buy and who to vote for in an election. Other times, we use it to share updates about our new puppy and to recommend the new brewery in town. Yet, despite all these positive use cases for social media, we know that not all information is created equal. Within this mixture of dog pictures, beer recommendations, and news is an array of low-quality information, ranging from deliberately false information to dangerous conspiracy theories to hate speech.

And this bad content has been said to have, for lack of a better term, offline harms. Harms like inciting violence or not getting vaccinated during a deadly pandemic. While it is very difficult to say if a bad social media post causes a bad decision, evidence suggests that social media plays some role (Blackburn and Zannettou 2022). These problems have forced academic researchers, policy makers, and the social media platforms themselves to think deeply about ways to stop its spread and consumption.

—Will automated content moderation work?—

The promised solution 

One proposed solution to stop bad content’s spread is to use Artificial Intelligence (AI) (often called automated content moderation). In fact, in response to public scrutiny and logistical challenges, AI is frequently the solution offered up by social media companies to concerned users, lawmakers, and investors (Gillespie 2020).

The goal of automated content moderation tools can range from harsh to soft. A tool may automatically remove a bad post from a social media feed. It may demote a bad post, making it difficult to discover on a platform. The tool could place a warning label on the post, telling users to doubt the credibility or accuracy of the content. It can temporarily ban users who repeatedly break rules or quarantine communities that repeatedly break rules.  However, no matter the goal of the moderation tool, the big question is: will automating that goal actually work? 

Will automated content moderation work? 

Researchers and software engineers have made great strides in building automated content moderation tools and have written many papers about these tools. To more concretely describe what I mean by many: since 2016 there have been over 58,000 papers indexed by Google Scholar that use the phrase “fake news detection.” The proposed software solutions from these papers range widely. Some use features of the text in a social media post or a news article to predict if the content is bad (Horne et al. 2019). Others use features related to the users who share the information on a social network to predict if the content is bad (Shu et al. 2019). And numerous of them report surprisingly high prediction accuracy on test sets of data.  

Despite all this progress, I (and others) argue that these tools are not good enough yet. There are a few reasons for this doubt, but I want to focus on one major gap in how we evaluate content moderation tools: we do not know what happens when an automated tool makes a mistake. And no matter how high the reported accuracy of a tool is, these tools are prone to making mistakes.

Why are machines prone to making mistakes?

In most settings, content moderation tools are machine learning tools trained to classify content as bad or good. At the most basic level, this means that to train the tool we need to label a set of data as fitting into these two categories. For instance, in news article classifiers, we could label the individual claims in the articles as being true or false, we could label each full article as having true or false content, or we could label each news outlet as being reliable or unreliable. In each case, the line of what is considered good and what is considered bad must be drawn. Often this means leaving out data that we are uncertain about from training and testing.

Where should we draw the line between good and bad when training these tools? What data should be left out of training and testing? Should this line change over time? How does this line work when labeling user-generated social media posts or hate speech? 

Simplified hypothetical example of labeling news data by outlet reliability

Why the line matters

This line choice can be a problem because we test our tools on data that comes from the same distribution as the training examples (samples that are independently and identically distributed, often called the IID assumption) (Bengio et al. 2021).  This idea can be a bit hard to picture, but in our example, this would mean that: if my news article classifier is asked to predict if an article from an outlet that I did not label during training is good or bad, we can’t be certain the prediction will be correct. If the article looks like other bad articles in our training data set, then it will be predicted to be bad. If the article looks drastically different than all articles in our training data set, it’s a coin toss.

When we compute a standard accuracy score of an automated content moderation tool, we are only guaranteed that calculated accuracy if the tool is deployed in a setting where the same distribution of inputs is given to it. 

Let me give you some practical examples. If a tool is trained on U.S. news articles, it may make mistakes on U.K. news articles (Horne et al. 2020). If a tool is trained on data from one time period, in which a major event happened, it may make mistakes on content related to a future major event (Bozarth and Budak 2020, Horne et al. 2019). If a tool is trained on data that is biased towards one political leaning, it may make mistakes on articles from the other political leaning (Bozarth and Budak 2020). While there are many variations in configuring these tools, each has difficult performance and bias trade-offs.

“Objectively singular and knowable”Hutchinson et al. 2022

Long story short, when building these automated tools, we make strong assumptions that abstract concepts (like what is true or what is good) can be mapped cleanly onto well-defined categories. We assume that our labeled data is “objectively singular and knowable”, contrasting the fact that knowledge may be “socially and culturally dependent” (Hutchinson et al. 2022, Udupa et al. 2022). A machine is prone to making mistakes in this setting.  

The cost of making mistakes 

So what? Is making a few mistakes worse than the situation we are in now? Well, that is hard to say because we haven’t thoroughly studied what happens when mistakes are made. If a moderation tool mistakenly places a warning label on credible content, will users trust future warning labels? Is mislabeling a controversial, yet true, news article as false going to drive users to fringe, alt-tech social media platforms, where information may be more extreme? Is mislabeling a true social media post as false as costly as not labeling a false social media post at all? Are mistakes on information related to politics as bad as mistakes on information related to public health? 

If automated content moderation is going to have any chance at solving our problems with misinformation, these types of questions need to be studied. In turn, the answers to these questions can be used to evaluate our moderation tools. Only then can we have any confidence in automated content moderation.  

Further, depending on the results of studying mistakes made, we must be open to automated content moderation not being the solution to our problem, and instead seek alternatives (Huchinson et al. 2022).


Bengio, Y., Lecun, Y., & Hinton, G. (2021). Deep learning for AI. Communications of the ACM64(7), 58-65. 

Blackburn, J., & Zannettou, S. (2022). Effect of Social Networking on Real-World Events. IEEE Internet Computing26(2), 5-6. 

Bozarth, L., & Budak, C. (2020, May). Toward a better performance evaluation framework for fake news classification. In Proceedings of the international AAAI conference on web and social media (Vol. 14, pp. 60-71). 

Gillespie, T. (2020). Content moderation, AI, and the question of scale. Big Data & Society7(2), 2053951720943234. 

Horne, B. D., Nørregaard, J., & Adali, S. (2019). Robust fake news detection over time and attack. ACM Transactions on Intelligent Systems and Technology (TIST)11(1), 1-23. 

Horne, B. D., Gruppi, M., & Adalı, S. (2020). Do all good actors look the same? exploring news veracity detection across the us and the uk. arXiv preprint arXiv:2006.01211

Hutchinson, B., Rostamzadeh, N., Greer, C., Heller, K., & Prabhakaran, V. (2022). Evaluation Gaps in Machine Learning Practice. arXiv preprint arXiv:2205.05256

Pennycook, G., Bear, A., Collins, E. T., & Rand, D. G. (2020). The implied truth effect: Attaching warnings to a subset of fake news headlines increases perceived accuracy of headlines without warnings. Management Science66(11), 4944-4957. 

Shu, K., Wang, S., & Liu, H. (2019, January). Beyond news contents: The role of social context for fake news detection. In Proceedings of the twelfth ACM international conference on web search and data mining (pp. 312-320). 

Udupa, S., Maronikolakis, A., Schütze, H., & Wisiorek, A. (2022). Ethical Scaling for Content Moderation: Extreme Speech and the (In) Significance of Artificial Intelligence. 

Cite this article in APA as: Horne, B. D. (2023, January 11). Is automated content moderation going to solve our misinformation problems. Information Matters, Vol. 3, Issue 1.

Benjamin D. Horne

Ben Horne is an Assistant professor in the School of Information Sciences at The University of Tennessee Knoxville. He received his Ph.D. in Computer Science from Rensselaer Polytechnic Institute in Troy, New York, where he received the Robert McNaughton Prize for outstanding graduate in Computer Science. Dr. Horne is a highly interdisciplinary, computational social scientist whose research focuses on safety in media spaces. Broadly, this research includes analyzing disinformation, propaganda, conspiracy theories, and the like in both social media and news media. His work has been published in conference venues such as ICWSM and TheWebConference (WWW), and in journals such as ACM Transactions of Intelligent Systems Technology, ACM Transactions of Social Computing, and Computers in Human Behavior. Additionally, Dr. Horne’s work has been widely covered in news media, such as Reuters, Business Insider, Mashable, IEEE Spectrum, and YLE.