We show example annotated bounding boxes for each model (labeled on the right). The bounding boxes are annotated in red, with their artifact category and user-annotated description below the frames. Video quality (“Overall”) and video-prompt alignment (“Prompt”) are shown to the left. Summary statistics are shown in the radar plots. Specifically, we show statistics for each artifact category, grouped by category count, average video-prompt alignment, and average video quality rating given the user-selected artifact categories. This is done for the three models in our dataset: Sora, VideoCrafter2, and Pika. See additional examples in the Appendix.
Abstract
Video generated by the current state-of-the-art generative models contain undesirable artifacts. We introduce GeneVA, the first large-scale dataset of human-annotated artifact bounding boxes in AI-generated videos. The dataset consists of 16,356 AI-generated videos, each labeled by a human annotator with per-frame artifact bounding boxes, their labels and descriptions, and video quality ratings. A custom data collection pipeline was developed in Prolific, and a novel taxonomy for spatio-temporal artifacts present in AI-generated videos was defined. The videos were from the VidProM [41] dataset, with text prompts from this dataset then used to generate an additional subset of videos using Sora. We trained an artifact detector and caption generator using a pre-trained image-based model, and a custom temporal fusion module. The dataset can be found at dummylink.com. We hope that datasets like GeneVA will encourage improvements in artifact detection in AI-generated video towards applications such as deepfake detection.