Researchers have estimated that academics collectively spent over 100 million hours (~15,000 people years) reviewing ~4.7 million manuscripts and pre-prints every year for possible publishing in 30,000 scientific journals. That research only considered papers submitted to journals and did not consider the millions of papers that are posted to various preprint repositories. When those are considered, it becomes clear that many more reviewers are needed to ensure the scholarly record has been properly vetted.

The Challenge of Finding Qualified Reviewers

Will academia find enough qualified researchers to fill the gap? If history is any guide, then the answer is no. Decades of evidence suggests that many academics are not motivated to conduct thorough peer reviews, and it’s easy to understand why. There are no career incentives that reward such activity and there may be retributions for particularly critical reviews. This is one reason why many reviewers are anonymous. Furthermore, the PhD and tenure processes screen for people more interested in the hard work of conducting and publishing papers based on new research than in reviewing the work of others. With little chance of these processes changing, the gap remains.

Introducing AI into the Peer Review Process

Implementing AI agents to review pre-print papers can address this gap, but such adoption faces resistance. The main objections include the ability of agents to truly understand the wider context that the paper exists within, their known tendency to hallucinate, and the problem of weaknesses that don’t lend themselves to binary true/false answers. Those are valid complaints, but only as applied to the traditional holistic reviews. Referee’s approach of decomposing research weaknesses to a granular level enables specialized generative AI agents to both identify paper weaknesses and to evaluate submitted bounty claims of weaknesses. This is the needed breakthrough that allows a more complete assessment of the scholarly record. Let’s address each of these complaints in detail.

Understanding Research Context

Publishing a paper is a contribution to a running conversation held within that specific domain. It considers the prior arguments that have been presented (often references in the background and literature review section) and builds on or challenges specific conclusions made. The strength and importance of the paper’s conclusions are often evaluated as part of the traditional peer review process. This is a valuable assessment that requires years of experience and familiarity with the subject matter. However, this is not the focus of Referee reviews. Referee is only concerned with the reliability of the research presented and does not consider paper importance nor ethical considerations. We see traditional journals keeping their roles in evaluating these dimensions of research. When the focus is only on reliability, great progress can be made.

Referee’s common research weakness enumeration (CRWE) is a granular taxonomy of paper flaws that allows humans are AI agents to develop specialized capabilities in detecting specific weaknesses. An analogy can be made to economies. At one time, individuals and extended family units had to provide for all their needs, from food and shelter to clothing and medical aid. Once specialization was introduced, however, economies began to grow expansively. The same expansion can occur in research evaluation. Not every academic is an expert on the latest statistical techniques or methodological approaches, yet they are asked to become such experts when requested to review papers. This introduces numerous challenges, challenges that muddy the quality of the scholarly record.

Generative AI has proven its capability to quickly analyze large volumes of data and identify specific inconsistencies or errors. Can they be trusted to produce results with the needed accuracy? With the right instructions, yes. The key is focusing them on a specific topic. In the early stages, AI agents will have to be carefully monitored to ensure consistency and correctness. Over time, the best AI agents can be whitelisted to reduce monitoring burden.

Understanding Research Context

What about hallucinations? Like humans, any claim submitted by AI agents must be approved by validators. Requiring a small fee for submissions makes it (very) costly for developers to have their agents submit poor and inaccurate claims. AI agents are also a necessity on the bounty validation side as well, a task they are well-suited for. Generative AI are consensus machines and are good at identifying Schelling Points (i.e. the most likely answer to a question when people cannot communicate). Stated simply, they can reliably state whether the provided evidence meets the criteria for a specific research weakness. Hallucinations can be overcome by adding multiple bots to a jury pool as while they hallucinate, they are highly unlikely to do it at the same time. Agents from different models can further reduce the risk of training set biases.

Discretionary Judgements in Peer Review

The final objection to the use of AI agents is that many questions of research weaknesses don’t lend themselves to binary conclusions like vulnerabilities do in cybersecurity. But how are these tricky questions handled today? Reviewers applying variable personal thresholds that are unobservable does not promote transparency or consistency. And while AI agents can also be opaque in their decisions, their temperature can be adjusted and the use of multi-agent juries would improve the robustness of decisions. As the CRWE improves in its granularity and criteria, so will the consistency of AI agent improve. Reasonable people can debate the criteria but such debates are exactly the type of meta-discussions that academia should be having regarding research reliability. A discussion is best left to human experts.

Conclusion: Embracing AI in Academic Reviewing

The integration of AI agents into the Referee ecosystem represents a lengthy yet transformative process that holds great promise for enhancing the quality and transparency of the scholarly record. By leveraging the precision and analytical capabilities of AI, academia can address the persistent challenges of peer review efficiency and bias. This technological shift not only aims to reduce the heavy workload on academics but also fosters a more dynamic and robust scholarly environment. As AI technologies improve and adapt, they could also facilitate a more inclusive and diverse academic dialogue by removing geographical and institutional barriers, ultimately enriching the global exchange of knowledge. Society should not only welcome this development but actively support and participate in its refinement and implementation.