Video Quality Assessment Dataset

G&M Lab head:
Measurements, analysis:
Dr. Dmitriy Vatolin
Aleksandr Gushchin, Maxim Smirnov, Anastasia Antsiferova, Eugene Lyapustin

Key features

  • 40 different video codecs of 10 compression standards
  • 2500+ compressed streams
  • 780.000+ subjective scores
  • 10.000+ viewers
  • Includes user-generated content
  • Published open part of the dataset

Downloads

NIPS Paper (2022)

PDFOpenreview linkSupplementary materialsGitHub

Dataset

To download the database, please fill-in the request form. You will get the download link for all data via e-mail.

Methodology

Video Compression

To analyze the relevance of quality metrics to video compression, we collected a special dataset of videos exhibiting various compression artifacts. For video-compression-quality measurement, the original videos should have a high bitrate or, ideally, be uncompressed to avoid recompression artifacts. We chose from a pool of more than 18,000 high-bitrate open-source videos from www.vimeo.com. Our search included a variety of minor keywords to provide maximum coverage of potential results – for example “a,” “the,” “of,” “in,” “be,” and “to.” We downloaded only videos that were available under CC BY and CC0 licenses and that had a minimum bitrate of 20 Mbps. The average bitrate of the entire collection was 130 Mbps. We converted all videos to a YUV 4:2:0 chroma subsampling. Our choice employed space-time-complexity clustering to obtain a representative complexity distribution. For spatial complexity, we calculated the average size of x264-encoded I-frames normalized to the uncompressed frame size. For temporal complexity, we calculated the average P-frame size divided by the average I-frame size. We divided the whole collection into 36 clusters using the K-means algorithm and, for each cluster, randomly selected up to 10 candidate videos close to the cluster center. From each cluster’s candidates we manually chose one video, attempting to include different genres in the final dataset (sports, gaming, nature, interviews, UGC, etc.). The result was 36 FullHD videos for further compression.

PSNR Range/Uniformity comparison (left); SI/TI characteristics and clusters during dataset creation (right)

We obtained numerous coding artifacts by compressing videos through several encoders: 11 H.265/HEVC encoders, 5 AV1 encoders, 2 H.264/AVC encoders, and 4 encoders based on other standards. To increase the diversity of coding artifacts, we also used two different presets for many encoders: one that provides a 30 FPS encoding speed and the other that provides a 1 FPS speed and higher quality. The list of settings for each encoder is presented in the supplementary materials. Not all videos underwent compression using all encoders. We compressed each video at three target bitrates — 1,000 kbps, 2,000 kbps, and 4,000 kbps — using a VBR mode (for encoders that support it) or with corresponding QP/CRF values that produce these bitrates. We avoided higher target bitrates because visible compression artifacts become almost unnoticeable, hindering subjective comparisons. Fig. 2 shows the distribution of video bitrates for our dataset. The distribution differs from the target encoding rates because we used the VBR encoding mode, but it complies with the typical recommendations.

Bitrate distribution of distorted videos

The dataset falls into two parts: open and hidden (40% and 60% of the entire dataset, respectively). We employ hidden part only for testing through our benchmark to ensure a more objective comparison of future applications. This approach may prevent learning-based methods from training on the entire dataset, thereby avoiding overfitting and incorrect results. To divide our dataset, we split the codec list in two; the encoded videos each reside in the part corresponding to their respective codec. We also performed x265-lossless encoding of all compressed streams to simplify further evaluations and avoid issues with nonstandard decoders. Links to source videos and additional details about the collection process are in the supplementary materials. We also compared the statistics of PSNR uniformity and range for our dataset using the approach in [1].

Subjective-Score Collection

We collected subjective scores for our video dataset through the Subjectify.us crowdsourcing platform. Subjectify.us is a service for pairwise comparisons; it employs a Bradley-Terry model to transform the results of pairwise voting into a score for each video. A more detailed description of the method is at www.subjectify.us. Because the number of pairwise comparisons grows exponentially with the number of source videos, we divided the dataset into five subsets by source videos and performed five comparisons. Each subset contained a group of source videos and their compressed versions. Every comparison produced and evaluated all possible pairs of compressed videos for one source video. Thus, only videos from the same source were in each pair. The comparison set also included source videos. Participants viewed videos from each pair sequentially in full-screen mode. They were asked to choose the video with the best visual quality or indicate that the two are of the same quality. They also had an option to replay the videos. Each participant had to compare a total of 12 pairs, two of which had an obviously higher-quality option and served as verification questions. All responses from those who failed to correctly answer the verification questions were discarded. To increase the relevance of the results, we solicited at least 10 responses for each pair.

In total, we collected 766,362 valid answers from nearly 11,000 individuals. After applying the Bradley-Terry model to a table of pairwise ranks, we received subjective scores that are consistent within each group of videos compressed from one reference video.

Reference

  1. Stefan Winkler. Analysis of public image and video databases for quality assessment. IEEE Journal of Selected Topics in Signal Processing, 6(6):616–625, 2012. 5

Citation

	@inproceedings{
		antsiferova2022video,
		title={Video compression dataset and benchmark of learning-based video-quality metrics},
		author={Anastasia Antsiferova and Sergey Lavrushkin and Maksim Smirnov and Aleksandr Gushchin and Dmitriy S. Vatolin and Dmitriy Kulikov},
		booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
		year={2022},
		url={https://openreview.net/forum?id=My5AI9aM49R}
	}
12 Dec 2022
See Also
MSU Video Upscalers Benchmark 2022
The most extensive comparison of video super-resolution (VSR) algorithms by subjective quality
Real-World Stereo Color and Sharpness Mismatch Dataset
Download new real-world video dataset of stereo color and sharpness mismatches
Forecasting of viewers’ discomfort
How do distortions in a stereo movie affect the discomfort of viewers?
SAVAM - the database
During our work we have created the database of human eye-movements captured while viewing various videos
Site structure