The Methodology of the Super-Resolution Quality Metrics Benchmark
In the modern world, we are constantly confronted with images and videos that are often distributed in not the best quality. That is why the task of Super-Resolution (SR) is very important and relevant.
Super-Resolution (SR) – the task of obtaining high-resolution (HR) images/videos from low-resolution (LR) images/videos.
In recent years, hundreds of different SR methods have been created by people. But often, models trained on certain data don’t improve the video or image, but distort it (especially if the original data is of poor quality). Therefore, it is important to have an objective metric that people can use to easily evaluate the performance of a particular method. The main goal of such a metric should be the coincidence of its assessment with the subjective assessments of people.
Is there a universal metric? So far, definitely not. PSNR and SSIM, used for many years, have long proved their failure. Now, metrics obtained by deep learning on giant datasets are gaining more and more popularity. But it is impossible to collect in one dataset absolutely all the cases to which the SR task is applicable, just as it is impossible to make a universal method by which the metric would work. Therefore, you can only choose a metric suitable for your specific tasks.
The goal of our benchmark is to evaluate the performance of existing image and video quality metrics relative for SR task and find the best methods
Our Super-Resolution Quality Metrics Benchmark includes >60 different metrics, both FR (requiring reference data) and NR (working directly on distorted images and videos). All of them were tested on videos with different resolution, FPS, bitrate, colorfulness, spatial and temporal information. This sequence affects as many SR usecases as possible. We evaluate the performance of the metric by correlating the scores given by it with the subjective scores of each distorted video. We computed three types of correlation: Pearson’s linear correlation coefficient (PLCC), Spearman’s rank ordered correlation coefficient (SROCC), and Kendall’s rank-order correlation coefficient (KROCC).
To simplify the task of choosing the most appropriate metric, the runtime was calculated for each of them. The metric time is estimated using the FPS value (how many frames the metric can process per second).
All results were presented in the form of tables and “live” graphs to make it easier to compare metrics at a glance.
Figure 1. The dataset sample
The dataset for this benchmark includes videos from the following datasets:
For all GT videos from these datasets, the features were calculated: bitrate, colorfulness, FPS, resolution, spatial (SI) and temporal (TI) information. For spatial complexity, we calculated the average size of x264-encoded I-frames normalized to the uncompressed frame size. For temporal complexity, we calculated the average P-frame size divided by the average I-frame size. Using a simple opponent color space representation we calculated the colorfulness of every video. Bitrate, FPS and resolution were obtained by using ffmpeg. Then we divided the whole collection into 30 clusters using the K-means algorithm, and chose one video from each cluster.
Thus, the final dataset consists of 30 reference (ground-truth, GT) videos, which correspond to 1187 distorted videos.
Figure 2. The features distribution of the selected videos for each dataset
When creating a benchmark, one of the main tasks was to obtain a dataset that is as close to complete and non-redundant as possible, the results on which would be close to the results on real data.
What is our dataset made of? It includes:
Indoor and outdoor videos
Videos with text, both superimposed on the video, and filmed on camera
Scenes with moving objects and moving camera
Architecture, people and complex textures
Dataset consists of real videos (were filmed with 2 cameras), video games footages, movies, cartoons, dynamic ads. The completeness of the dataset is also proved by the fact that the videos for it were taken from 4 datasets covering different tasks.
|Dataset||Number of GT
|VSR Benchmark Dataset||7||258||32||46260|
|VUB Benchmark Dataset||12||240||41||22800|
Figure 3. Distribution of taken videos by datasets
How we brought our dataset closer to completeness?
The dataset covers a large number of use cases in the field of SR due to the large number of content types
The dataset contains videos with completely different resolutions, FPS values: 8, 24, 25, 30, 60, as well as high and low spatio-temporal complexity
Distorted videos were obtained using 46 SR methods, some of them were preprocessed with 5 codecs: aomenc, vvenc, x264, x265, uavs3es with different bitrates and qp values
The dataset was manually checked for redundancy
Videos from benchmarks are FullHD video crops, since the subjective comparison was made on crops. Therefore, the resolution of all videos in the received dataset is low.
The dataset contains videos with the following resolutions:
This ensures that as many formats as possible in which reference videos may be available are covered.
As mentioned above, the benchmark contains >80 metrics, including both NR and FR, suitable for different use cases. Among them, some require video as input (Video Quality Assessment (VQA) metrics), some require images (Image Quality Assessment (IQA) metrics).
Correlation with Subjective Scores
All metrics were tested on 1187 distorted videos (>100000 frames)
The image quality metrics’ values were calculated on each frame separately, and then averaged over all frames. Then Pearson, Spearman and Kendall Correlation Coefficients were calculated for each group (videos with the same GT) of videos separately. The correlation of each of the four datasets mentioned earlier is equal to the mean value of all videos from it.
The correlation on the full dataset was also calculated as mean value of correlations on these four datasets (for a rough estimate; it’s better to look at correlations on each dataset separately)
Figure 4. Videos on which metrics' runtime was tested
Overall, we chose 5 GT videos and 1 distorted video for each of them
For IQA metrics their working time is equal to average value of each frame’s processing time
For VQA metrics working time for frame was calculated as metric runtime on full video divided by number of frames
Then we calculated the runtime for each GT video 3 times and chose the minimum time for each video. Final metric runtime was calculated as mean value of runtime on each GT video.
Calculations were made using the following harware:
- GPU: NVIDIA RTX A6000
- CPU: 64 CPUs cluster, AMD EPYC 7532, 32×2,4 GHz
- Expand the dataset
- Run more metrics
- Create our own metric
- Publish the paper about our work
- Research metrics’ mistakes and Super-Resolution artifacts
- Add some new interesting tracks to the benchmark
C. A. Bouman, “Opponent Color Spaces”, in Digital Image Processing, 2023.
MSU Benchmark Collection
- Super-Resolution Quality Metrics Benchmark
- Video Saliency Prediction Benchmark
- Super-Resolution for Video Compression Benchmark
- Metrics Robustness Benchmark
- Video Upscalers Benchmark
- Video Deblurring Benchmark
- Video Frame Interpolation Benchmark
- HDR Video Reconstruction Benchmark
- No-Reference Video Quality Metrics Benchmark
- Full-Reference Video Quality Metrics Benchmark
- Video Alignment and Retrieval Benchmark
- Mobile Video Codecs Benchmark
- Video Super-Resolution Benchmark
- Shot Boundary Detection Benchmark
- Deinterlacer Benchmark
- The VideoMatting Project
- Video Completion
- Codecs Comparisons & Optimization
- MSU Datasets Collection
- Metrics Research
- Video Quality Measurement Tool 3D
- Video Filters
- Other Projects