Overview

We are running a new iteration of the competition on the detection of AI-generated scientific content! Find the old page here, the summary of the results in this paper, and the list of changes this year here.

Introduction

Generative Artificial Intelligence has become a hot topic in the publishing industry. The emergence of generative AI poses a significant problem for publishers, as generated content is now almost indistinguishable from human-generated content. This gives rise to numerous research integrity challenges, such as the emergence of "paper mills", the publication of nonsensical papers, and other issues that compromise the overall quality of scientific knowledge disseminated within the wider community through research papers.

In the paper "The 'Problematic Paper Screener automatically selects suspect publications for post-publication (re)assessment" the authors (including the co-organizer of this competition Cyril Labbé) share a dashboard listing of thousands of published papers, some of which are entirely or partly computer-generated. In the paper "Tortured phrases: A dubious writing style emerging in science", the same authors made a call for the investigation of dozens of thousands of papers showing signs of "tortured phrases" and their relation to "paper mills". Specifically, they identified around 400 dubious papers published by "Microprocessors and Microsystems", which exhibited notably high GPT-2 detector scores. This suggests that these papers were likely computer-generated. With recent advances alike GPT-4, in the ongoing arms race between publishers and fraudsters, publishers find themselves at a disadvantage. Possessing a reliable AI-generated text detection system would enable them to more rapidly identify and reject nonsensical papers, including those coming from "paper mills".

Task

Given a long excerpt from a full text of a scientific paper, the task is to classify tokens into human-written or transformed with one of the Deep Learning-based models: paraphraser, generator, etc.

Example:

A record from the training set having a full text (a long string) and the following annotation [[0, 3386, 'human'], [3387, 4929, 'summarized'], [4930, 17898, 'human'], [17899, 18923, 'gpt3']] means that words 0-3386 of white-space separated text were not altered (i.e., are allegedly human-written), tokens 3387-4929 were transformed with a summarization model, tokens 4930-17898 were left intact, and token 17899 to 18923 were generated with GPT-3 where the first sentence was used as a prompt.

The task is to produce such annotations for the test set.

Materials & tutorials

Baselines will be shared prior to or at the competition start (July 2^nd). Please check the AICrowd page (TBA). Meanwhile, you may find the following material useful for learning about the detection of AI-generated scientific content:

Guillaume Cabanac, Cyril Labbé, and Alexander Magazinov. "Tortured phrases: A dubious writing style emerging in science. Evidence of critical issues affecting established journals." 2021;
"New AI classifier for indicating AI-written text" by OpenAI;
Yury Kashnitsky, Drahomira Herrmannova, Anita de Waard, George Tsatsaronis, Catriona Catriona Fennell, and Cyril Labbe. "Overview of the DAGPap22 shared task on detecting automatically generated scientific papers." In Proceedings of the Third Workshop on Scholarly Document Processing, pages 210-213, Gyeongju, Republic of Korea, October 2022. Association for Computational Linguistics;
Jesus Guerrero and Izzat Alsmadi. "Synthetic text detection: Systemic literature review." 2022;
"Token classification" tutorial by HuggingFace.

FAQ

When does the competition launch?
On July 3rd, pls. Check the Timeline section.

When the competition is over, am I obliged to submit my paper to NeurIPS?
We invite the winners to contribute to the competition write-up

Is the competition about the detection of chatGPT-generated papers?
GPT-like models indeed are used as one of the means of creating AI-generated content but not the only one. We make the task more tangible by introducing some easier to spot methods like synonym replacement and paraphrasing with Spinbot.

Leaderboard & Evaluations

During the competition, we maintain a leaderboard for models evaluated on the public test set. At the end of the competition, a private leaderboard will be maintained for models evaluated on the private test set. This latter leaderboard will be used to decide the winners of the competition. The leaderboard on the public test set is meant to guide participants on their model performance and compare it with other participants.

Timeline

3rd July 2023: Competition launch, the detailed baseline is released. Participants are invited to start submitting their solutions;
30th September 2023: Submissions are closed and organizers begin the evaluation process;
October 2023: Winners are announced and are invited to contribute to the competition write-up;
10th-16th of December 2023: Presentation at NeurIPS 2023

Prizes

Monetary

The challenge features a Total Cash Prize Pool of $5,000 USD. We will evaluate submissions as described in the Evaluation section. The three teams that score highest on this evaluation will receive prizes as follows:

1st place: $3,000 USD
2nd place: $1,000 USD
3rd place: $1,000 USD

Authorship

In addition to the cash prizes, we will invite the top three teams for the authorship summary manuscript at the end of the competition. At our discretion, we may also include honorable mentions for academically interesting approaches. Honorable mentions will be invited to contribute a shorter section to the paper and have their names included inline.

Changes from COLING SDP 2022

In 2022, we hosted the competition as a part of the shared task “DAGPap22: Detecting automatically generated scientific papers” hosted within the third workshop on Scholarly Document Processing (SDP 2022), being held in association with the 29th International Conference on Computational Linguistics (COLING 2022). In that challenge, we proposed a binary classification task with human-written excerpts from papers on a wide range of topics. Despite almost perfect F1 scores seen on the competition leaderboard, we found that detectors trained with the competition data fail to generalize well to other sources of similar data, generated with new types of Deep Learning models and/or belonging to different subject areas.

We, therefore, introduce the following changes:

We focus on longer text excerpts of the text of 2500-4000 characters (as opposed to the earlier version of 500 chars). This is closer to the real-world task of the detection of AI-generated content in full texts of the articles, and not only in abstracts;
We pose the task as span-level classification, instead of document-level classification and challenge the participants to develop models capable of spotting AI-generated pieces in long text excerpts. This is motivated by the concern that future scientific content might consist of both human and machine-written text;
To check for robustness to model drift and data drift, we are keeping some competition data, coming from specific models and subject areas only in the test set. This way, we encourage the models trained by competitors to be able to generalize beyond the AI generation techniques and science subject areas seen in the training set.

Contact

The organizing team:

Yury Kashnitsky (Elsevier)
Savvas Chamezopoulos (Elsevier)
Domenic Rosati (scite.ai, Dalhousie University)
Cyril Labbé (Université Grenoble Alpes)
Drahomira Herrmannova (Elsevier)
Anita de Waard (Elsevier)
Georgios Tsatsaronis (Elsevier)

Please use dagpap@googlegroups.com for all communication to reach the organizing team.