The Methodology of the Super-Resolution Quality Metrics Benchmark

Problem Definition

In the modern world, we are constantly confronted with images and videos that are often distributed in not the best quality. That is why the task of Super-Resolution (SR) is very important and relevant.

Super-Resolution (SR) – the task of obtaining high-resolution (HR) images/videos from low-resolution (LR) images/videos.

In recent years, hundreds of different SR methods have been created by people. But often, models trained on certain data don’t improve the video or image, but distort it (especially if the original data is of poor quality). Therefore, it is important to have an objective metric that people can use to easily evaluate the performance of a particular method. The main goal of such a metric should be the coincidence of its assessment with the subjective assessments of people.

Is there a universal metric? So far, definitely not. PSNR and SSIM, used for many years, have long proved their failure^[1]. Now, metrics obtained by deep learning on giant datasets are gaining more and more popularity. But it is impossible to collect in one dataset absolutely all the cases to which the SR task is applicable, just as it is impossible to make a universal method by which the metric would work. Therefore, you can only choose a metric suitable for your specific tasks.

The goal of our benchmark is to evaluate the performance of existing image and video quality metrics relative for SR task and find the best methods

Our Super-Resolution Quality Metrics Benchmark includes >60 different metrics, both FR (requiring reference data) and NR (working directly on distorted images and videos). All of them were tested on videos with different resolution, FPS, bitrate, colorfulness, spatial and temporal information. This sequence affects as many SR usecases as possible. We evaluate the performance of the metric by correlating the scores given by it with the subjective scores of each distorted video. We computed three types of correlation: Pearson’s linear correlation coefficient (PLCC), Spearman’s rank ordered correlation coefficient (SROCC), and Kendall’s rank-order correlation coefficient (KROCC).

To simplify the task of choosing the most appropriate metric, the runtime was calculated for each of them. The metric time is estimated using the FPS value (how many frames the metric can process per second).

All results were presented in the form of tables and “live” graphs to make it easier to compare metrics at a glance.

Dataset

Figure 1. The dataset sample

The dataset for this benchmark includes videos from the following datasets:

For all GT videos from these datasets, the features were calculated: bitrate, colorfulness, FPS, resolution, spatial (SI) and temporal (TI) information. For spatial complexity, we calculated the average size of x264-encoded I-frames normalized to the uncompressed frame size. For temporal complexity, we calculated the average P-frame size divided by the average I-frame size. Using a simple opponent color space representation^[2] we calculated the colorfulness^[3] of every video. Bitrate, FPS and resolution were obtained by using ffmpeg. Then we divided the whole collection into 30 clusters using the K-means algorithm, and chose one video from each cluster.

Thus, the final dataset consists of 30 reference (ground-truth, GT) videos, which correspond to 1187 distorted videos.

Figure 2. The features distribution of the selected videos for each dataset

When creating a benchmark, one of the main tasks was to obtain a dataset that is as close to complete and non-redundant as possible, the results on which would be close to the results on real data.

What is our dataset made of? It includes:

Indoor and outdoor videos
Videos with text, both superimposed on the video, and filmed on camera
Scenes with moving objects and moving camera
Architecture, people and complex textures

Dataset consists of real videos (were filmed with 2 cameras), video games footages, movies, cartoons, dynamic ads. The completeness of the dataset is also proved by the fact that the videos for it were taken from 4 datasets covering different tasks.

Dataset	Number of GT videos taken	Number of distorted videos taken	Number of SR methods	Subjective Scores Count
SR Dataset	8	239	4	34510
SR+Codecs Dataset	3	450	9	65250
VSR Benchmark Dataset	7	258	32	46260
VUB Benchmark Dataset	12	240	41	22800
Overall	30	1187	46	168820

Figure 3. Distribution of taken videos by datasets

How we brought our dataset closer to completeness?

The dataset covers a large number of use cases in the field of SR due to the large number of content types
The dataset contains videos with completely different resolutions, FPS values: 8, 24, 25, 30, 60, as well as high and low spatio-temporal complexity
Distorted videos were obtained using 46 SR methods, some of them were preprocessed with 5 codecs: aomenc, vvenc, x264, x265, uavs3es with different bitrates and qp values
The dataset was manually checked for redundancy

Videos from benchmarks are FullHD video crops, since the subjective comparison was made on crops. Therefore, the resolution of all videos in the received dataset is low.

The dataset contains videos with the following resolutions:

480×270
200×170
110×80
320×270
120×90
180×150
130×100
360×270

This ensures that as many formats as possible in which reference videos may be available are covered.

Metrics

As mentioned above, the benchmark contains >80 metrics, including both NR and FR, suitable for different use cases. Among them, some require video as input (Video Quality Assessment (VQA) metrics), some require images (Image Quality Assessment (IQA) metrics).

Correlation with Subjective Scores

All metrics were tested on 1187 distorted videos (>100000 frames)

The image quality metrics’ values were calculated on each frame separately, and then averaged over all frames. Then Pearson, Spearman and Kendall Correlation Coefficients were calculated for each group (videos with the same GT) of videos separately. The correlation of each of the four datasets mentioned earlier is equal to the mean value of all videos from it.

The correlation on the full dataset was also calculated as mean value of correlations on these four datasets (for a rough estimate; it’s better to look at correlations on each dataset separately)

Runtime

Figure 4. Videos on which metrics' runtime was tested

Overall, we chose 5 GT videos and 1 distorted video for each of them

For IQA metrics their working time is equal to average value of each frame’s processing time

For VQA metrics working time for frame was calculated as metric runtime on full video divided by number of frames

Then we calculated the runtime for each GT video 3 times and chose the minimum time for each video. Final metric runtime was calculated as mean value of runtime on each GT video.

Calculations were made using the following harware:

GPU: NVIDIA RTX A6000
CPU: 64 CPUs cluster, AMD EPYC 7532, 32×2,4 GHz

Future work

Expand the dataset
Run more metrics
Create our own metric
Publish the paper about our work
Research metrics’ mistakes and Super-Resolution artifacts
Add some new interesting tracks to the benchmark

References

https://videoprocessing.ai/metrics/ways-of-cheating-on-popular-objective-metrics.html
C. A. Bouman, “Opponent Color Spaces”, in Digital Image Processing, 2023.
https://pyimagesearch.com/2017/06/05/computing-image-colorfulness-with-opencv-and-python/

24 Jun 2024

Video processing, compression and quality research group Based in MSU Graphics & Media Laboratory