The Methodology of the Super-Resolution Quality Metrics Benchmark

Problem Definition

In the modern world, we are constantly confronted with images and videos that are often distributed in not the best quality. That is why the task of Super-Resolution (SR) is very important and relevant.

Super-Resolution (SR) – the task of obtaining high-resolution (HR) images/videos from low-resolution (LR) images/videos.

In recent years, hundreds of different SR methods have been created by people. But often, models trained on certain data don’t improve the video or image, but distort it (especially if the original data is of poor quality). Therefore, it is important to have an objective metric that people can use to easily evaluate the performance of a particular method. The main goal of such a metric should be the coincidence of its assessment with the subjective assessments of people.

Is there a universal metric? So far, definitely not. PSNR and SSIM, used for many years, have long proved their failure[1]. Now, metrics obtained by deep learning on giant datasets are gaining more and more popularity. But it is impossible to collect in one dataset absolutely all the cases to which the SR task is applicable, just as it is impossible to make a universal method by which the metric would work. Therefore, you can only choose a metric suitable for your specific tasks.

The goal of our benchmark is to evaluate the performance of existing image and video quality metrics relative for SR task and find the best methods

Our Super-Resolution Quality Metrics Benchmark includes >60 different metrics, both FR (requiring reference data) and NR (working directly on distorted images and videos). All of them were tested on videos with different resolution, FPS, bitrate, colorfulness, spatial and temporal information. This sequence affects as many SR usecases as possible. We evaluate the performance of the metric by correlating the scores given by it with the subjective scores of each distorted video. We computed three types of correlation: Pearson’s linear correlation coefficient (PLCC), Spearman’s rank ordered correlation coefficient (SROCC), and Kendall’s rank-order correlation coefficient (KROCC).

To simplify the task of choosing the most appropriate metric, the runtime was calculated for each of them. The metric time is estimated using the FPS value (how many frames the metric can process per second).

All results were presented in the form of tables and “live” graphs to make it easier to compare metrics at a glance.


Figure 1. The dataset sample

The dataset for this benchmark includes videos from the following datasets:

  1. Our SR Datasets
  2. SR+Codecs Dataset
  3. VSR Benchmark Dataset
  4. VUB Benchmark Dataset

For all GT videos from these datasets, the features were calculated: bitrate, colorfulness, FPS, resolution, spatial (SI) and temporal (TI) information. For spatial complexity, we calculated the average size of x264-encoded I-frames normalized to the uncompressed frame size. For temporal complexity, we calculated the average P-frame size divided by the average I-frame size. Using a simple opponent color space representation[2] we calculated the colorfulness[3] of every video. Bitrate, FPS and resolution were obtained by using ffmpeg. Then we divided the whole collection into 30 clusters using the K-means algorithm, and chose one video from each cluster.

Thus, the final dataset consists of 30 reference (ground-truth, GT) videos, which correspond to 1187 distorted videos.

First Feature: Second Feature:

Figure 2. The features distribution of the selected videos for each dataset

When creating a benchmark, one of the main tasks was to obtain a dataset that is as close to complete and non-redundant as possible, the results on which would be close to the results on real data.

What is our dataset made of? It includes:

  1. Indoor and outdoor videos

  2. Videos with text, both superimposed on the video, and filmed on camera

  3. Scenes with moving objects and moving camera

  4. Architecture, people and complex textures

Dataset consists of real videos (were filmed with 2 cameras), video games footages, movies, cartoons, dynamic ads. The completeness of the dataset is also proved by the fact that the videos for it were taken from 4 datasets covering different tasks.

Dataset Number of GT
videos taken
Number of
videos taken
Number of
SR methods
Subjective Scores
SR Dataset 8 239 4 34510
SR+Codecs Dataset 3 450 9 65250
VSR Benchmark Dataset 7 258 32 46260
VUB Benchmark Dataset 12 240 41 22800
Overall 30 1187 46 168820

Figure 3. Distribution of taken videos by datasets

How we brought our dataset closer to completeness?

  • The dataset covers a large number of use cases in the field of SR due to the large number of content types

  • The dataset contains videos with completely different resolutions, FPS values: 8, 24, 25, 30, 60, as well as high and low spatio-temporal complexity

  • Distorted videos were obtained using 46 SR methods, some of them were preprocessed with 5 codecs: aomenc, vvenc, x264, x265, uavs3es with different bitrates and qp values

  • The dataset was manually checked for redundancy

Videos from benchmarks are FullHD video crops, since the subjective comparison was made on crops. Therefore, the resolution of all videos in the received dataset is low.

The dataset contains videos with the following resolutions:

  1. 480×270

  2. 200×170

  3. 110×80

  4. 320×270

  5. 120×90

  6. 180×150

  7. 130×100

  8. 360×270

This ensures that as many formats as possible in which reference videos may be available are covered.


As mentioned above, the benchmark contains >80 metrics, including both NR and FR, suitable for different use cases. Among them, some require video as input (Video Quality Assessment (VQA) metrics), some require images (Image Quality Assessment (IQA) metrics).

Correlation with Subjective Scores

All metrics were tested on 1187 distorted videos (>100000 frames)

The image quality metrics’ values were calculated on each frame separately, and then averaged over all frames. Then Pearson, Spearman and Kendall Correlation Coefficients were calculated for each group (videos with the same GT) of videos separately. The correlation of each of the four datasets mentioned earlier is equal to the mean value of all videos from it.

The correlation on the full dataset was also calculated as mean value of correlations on these four datasets (for a rough estimate; it’s better to look at correlations on each dataset separately)


Figure 4. Videos on which metrics' runtime was tested

Overall, we chose 5 GT videos and 1 distorted video for each of them

For IQA metrics their working time is equal to average value of each frame’s processing time

For VQA metrics working time for frame was calculated as metric runtime on full video divided by number of frames

Then we calculated the runtime for each GT video 3 times and chose the minimum time for each video. Final metric runtime was calculated as mean value of runtime on each GT video.

Calculations were made using the following harware:

  • CPU: 64 CPUs cluster, AMD EPYC 7532, 32×2,4 GHz

Future work

  • Expand the dataset
  • Run more metrics
  • Create our own metric
  • Publish the paper about our work
  • Research metrics’ mistakes and Super-Resolution artifacts
  • Add some new interesting tracks to the benchmark



  2. C. A. Bouman, “Opponent Color Spaces”, in Digital Image Processing, 2023.


05 May 2024
See Also
Real-World Stereo Color and Sharpness Mismatch Dataset
Download new real-world video dataset of stereo color and sharpness mismatches
Super-Resolution Quality Metrics Benchmark
Discover 66 Super-Resolution Quality Metrics and choose the most appropriate for your videos
Learning-Based Image Compression Benchmark
The First extensive comparison of Learned Image Compression algorithms
Video Saliency Prediction Benchmark
Explore the best video saliency prediction (VSP) algorithms
Super-Resolution for Video Compression Benchmark
Learn about the best SR methods for compressed videos and choose the best model to use with your codec
Metrics Robustness Benchmark
Check your image or video quality metric for robustness to adversarial attacks
Site structure