Does Double-blind Peer Review Reduce Bias?
In June 1936, John Tate, then the editor of Physical Review, received a manuscript with the astonishing claim that gravitational waves do not exist . What’s more astonishing, however, is that the claim came from one of the founding fathers of gravitational waves, Albert Einstein. Tate sent the manuscript out for peer review. On July 17, a 10-page long reviewer report came back. Apparently, the reviewer disagreed with Einstein. Einstein, who had never been challenged by peer review before, angrily withdrew the submission. Months later, the paper appeared in another journal with substantial modifications, and the gravitational waves were again possible. According to Einstein, his colleague Howard P. Robertson spotted errors in the original manuscript. The paper has since been cited more than 700 times and is one of the many seminal contributions from Einstein. It turned out Einstein could have done that much earlier—the anonymous reviewer that Tate invited was precisely Robertson.
—Apparently, the reviewer disagreed with Einstein.—
As annoying as it was for Einstein, this is peer review at its best. Ideally, when work is evaluated by experts in the field, the process will filter out poor research, and identify and polish discoveries into real gems. However, a long-standing concern is reviewer bias, which broadly refers to evaluating scientific work not based solely on its quality. For example, reviewers might discriminate against authors who are not famous, authors who are competitors, and authors from traditionally underrepresented groups. The consequences of reviewer bias can be severe. Important discoveries may be delayed, promising research careers stymied, and public trust in science jeopardized. Because many of the mechanisms driving bias are associated with reviewers knowing who the authors are, a natural remedy is to mask the authors’ identities during the review process, often called double-blind peer review. But does double-blind peer review actually reduce bias? Another long-standing concern with peer review is its reliability—can an author expect to receive the same decision no matter who happens to be the reviewer, or the reviewer’s mood that day? At the extreme, if the decisions are so noisy as to be nearly unpredictable, why do peer review at all? Might using double-blind peer review, intended to reduce bias, unintentionally increase or decrease reliability?
We sought to investigate the bias and reliability of double blind peer review using the reviewer files of 5,027 papers submitted to the International Conference on Learning Representations (ICLR), one of the top conferences in Artificial Intelligence (AI). In 2018, ICLR switched from the traditional single-blind peer review, in which reviewers could see authors’ identities, to double-blind peer review. This policy change allows us to evaluate the effect of double-blind peer review by comparing the outcome of papers submitted before 2018 to those papers submitted since 2018.
So what did we find? First, we found that double-blind peer review moderately reduces prestige bias—the top one-third of research groups in terms of citations, a useful metric of prestige, have significantly lower reviewer ratings in the double-blind setting than the single-blind one. However, the reduction is moderate and seems insufficient to impact the acceptance rate substantially, primarily because in many instances the scores are above the threshold for acceptance whether reviewed single- or double-blind. Intriguingly, the bottom one-third research groups did not enjoy a boost in their ratings. A possible explanation of this observation is that, while reviewers might typically give famous researchers a “premium,” they treat those non-famous researchers as more or less anonymous. Surprisingly, double-blind review decreases reliability, i.e. the variation between reviewer ratings for the same paper significantly increases. Apparently, when the identities of the paper authors are masked, reviewers disagree more.
An arguably disappointing lesson from this work is that double-blind peer review does not show as strong an effect on reducing bias as one would hope, and actually decreases reliability. However, we only examined the impact of double-blind peer review on prestige bias. There are numerous other kinds of biases in peer review related to the author’s identity, such as gender bias, and homophily bias, that may still be reduced. Even if double-blind peer review has only a small effect on each of these biases, the aggregated benefits may be considerable. To test this possibility, we compared the quality of papers rejected in the single-blind versus double-blind formats using their citation counts. If double-blind review is better at evaluating papers only on their quality, it should not reject papers of high quality. Indeed, we find that papers rejected under double-blind peer review gathered much lower citations in 2 years. This suggests that double-blinding does improve the effectiveness of peer review.
But our investigation did not stop here. In 2020, ICLR changed the rating scale reviewers use to score submitted papers from a fine-grained 10-point scale to a coarse 4-point scale, probably to reduce cognitive burden. Can this seemingly innocuous change affect reviewer bias? Scales and bias might seem totally unrelated, but they are not. Previous literature on teaching evaluations suggests that changing rating scales can sometimes reduce bias for two possible reasons. First, a coarse-scale would suppress evaluators from expressing their subtle preferences. Second, the highest score matters: some scores, such as “10 out of 10,” are associated with a cultural meaning of perfection. Reviewers might be reluctant to give this score to those who are not stereotypically associated with perfection. To see if these mechanisms might apply in peer review, we compared the peer review of papers in ICLR 2020 versus previous years. To our surprise, we found a significant effect of the coarser scale in reducing the scores given to the most prestigious research groups. Unlike the effect of blinding, the rating-scale change effect is significant enough to substantively affect acceptance decisions. In particular, the rating-scale change is about four times as effective as double-blind peer review in reducing prestige bias. Due to the recency of the data, we cannot assess whether scale-change is more effective in filtering out bad papers, which of course, is an interesting future question.
More research is undoubtedly needed to identify and test the best ways to evaluate scientific work. Our research provides support for the usefulness of double-blinding in reducing bias, shows that an unexpected outcome might be to increase disagreement, and suggests that rating scales might be an even more important remedy against bias.
For the details of this research, please see:
Sun, M., Barry Danfa, J., & Teplitskiy, M. (2021). Does double-blind peer review reduce bias? Evidence from a top computer science conference. Journal of the Association for Information Science and Technology, 1– 9. https://doi.org/10.1002/asi.24582
 Daniel Kennefick. “Einstein Versus the Physical Review.” Physics Today 58, 43-48 (2005) https://doi.org/10.1063/1.2117822
Cite this article in APA as: Sun, M. (2022, January 20). Does double-blind peer review reduce bias? Information Matters. Vol.2, Issue 1. https://r7q.22f.myftpupload.com/2022/01/does-double-blind-peer-review-reduce-bias/