Evaluation methodology of MSU Super-Resolution for Video Compression Benchmark 2021

You can read the Methodology below or download the presentation in pdf format here.
You also can see it in Google Slides here.

Benchmark Statistics





Valid answers in subjective comparison
















Table 1: Benchmark statistics

Problem definition

Super-Resolution is the process of calculating high-resolution samples from their low-resolution counterparts. Due to the rapid development of Video Super-Resolution technologies, they are used in video codecs.

Different SR models have different bitrate/quality tradeoffs when working with compressed video sequences. If two SRs produce results of the same subjective quality, the one that works with the lower bitrate input is considered to be better. Our benchmark aims to find the best Video Super-Resolution algorithm based on this criterion.

We are currently testing only 2x upscale, but we plan to test 4x upscale as well.


Our dataset is constantly being updated. You can see the current number of sequences in the dataset in Table 1. Each FullHD video in yuv format is decoded with 7 different bitrates using 5 different codecs. Videos were taken from MSU codecs comparison[1] 2019 and 2020 test sets. The dataset contains videos in FullHD resolution with FPS from 24 to 30.

All videos have low SI/TI value and simple textures. It was made to minimize compression artifacts that may occur to make restoration of details possible.

Figure 1. Segments from dataset







Animation clip

2D animation advertising clip
drawn in bright colors.



30 fps

104.58 Mbps


A man working with an oven
in the workshop.



30 fps

108.14 Mbps


Animation clip,
slow camera zoom.



30 fps

104.56 Mbps


A scene with professor from
international affairs school.



24 fps

165.41 Mbps

Skiing learning

People are being trained to ski
in slow motion.



24 fps

107.59 Mbps

Street show

Two men sing, dance and perform
some acrobatics on a street.



24 fps

108.40 Mbps


A couple walks slowly
at the ceremony.



24 fps

123.96 Mbps

Table 2: Dataset characteristics



PSNR is a commonly used metric for reconstruction quality for images and video. In our benchmark, we calculate PSNR on the Y component in YUV colorspace.

Since some Super-Resolution models can generate images with a global shift relative to GT, we calculate shifted PSNR. We check each shift in the range [-3, 3] (including subpixel shifts) for both axes and select the highest PSNR value among these shifts. We noticed that SRs’ results on the same video decoded with different bitrates usually have the same global shift. Thus we calculate the best shift only once for each video.

For metric calculation, we use MSU VQMT[2].


SSIM is a metric based on structural similarity. In our benchmark, we use Multiscale SSIM (MS-SSIM), which is conducted over multiple scales through a process of multiple stages of sub-sampling. We calculate MS-SSIM on all 3 components in the YUV colorspace, and the metric result is calculated as (4Y + U + V) / 6[13], where Y, U, and V are the MS-SSIM values on Y, U, and V components respectively.

MS-SSIM results also rely on the shift of frames. We take optimal subpixel shift for PSNR and apply in to input frames before calculating MS-SSIM.

For metric calculation, we use MSU VQMT[2].


VMAF is a perceptual video quality assessment algorithm developed by Netflix. In our benchmark, we calculate VMAF on the Y component in YUV colorspace. We use both VMAF and VMAF NEG (no enhancement gain) in our benchmark.

For metric calculation, we use MSU VQMT[2]. For VMAF we use -set "disable_clip=True" option of MSU VQMT.

Shifted VMAF and VMAF NEG give less than 1% gain relative to unshifted versions, that’s why we use unshifted versions in our benchmark. In Figure 2a and 2b you can see the gain that each model get by using shifted VMAF and VMAF NEG relative to unshifted versions.

Figure 2a. Shifted VMAF gain of each model

Figure 2b. Shifted VMAF NEG gain of each model


LPIPS (Learned Perceptual Image Patch Similarity) evaluates the distance between image patches. Higher means further/more different. Lower means more similar. In our benchmark, we subtract LPIPS value from 1. Thus, more similar images have higher metric values.

To calculate LPIPS we use Perceptual Similarity Metric implementation[3] proposed in The Unreasonable Effectiveness of Deep Features as a Perceptual Metric[4].

We have also noticed, that shifted LPIPS give less than 1% gain relative to the unshifted version, as you can see in Figure 3. That's why we calculate LPIPS without shift compensation in our benchmark.

Figure 3. Shifted LPIPS gain of each model


ERQAv1.0 (Edge Restoration Quality Assessment, version 1.0) estimates how well a model has restored edges of the high-resolution frame. This metric was developed for MSU Video Super-Resolution Benchmark 2021[5].

Firstly, we find edges in both output and GT frames. To do it we use OpenCV implementation[6] of the Canny algorithm[7]. A threshold for the initial finding of strong edges is set to 200 and a threshold for edge linking is set to 100. Then we compare these edges by using an F1-score. To compensate for the one-pixel shift, edges that are no more than one pixel away from the GT's are considered true-positive.

More information about this metric can be found at the Evaluation Methodology of MSU Video Super-Resolution Benchmark[9].

Figure 4. ERQAv1.0 visualization.
White pixels are True Positive, red pixels are False Positive, blue pixels are False Negative


To compress GT videos, we use the following codecs:






FFmpeg version 4.2.4



FFmpeg version 4.2.4



FFmpeg version 4.2.4



Fraunhofer Versatile Video Encoder[11]




Table 3: Codecs' description

For x264, x265, aomenc, and VVenC we use -preset=“medium” option.


Firstly, we downscale our FullHD GT video using FFmpeg to make it 960×540 resolution. We use the flags::gauss option to keep more information in the resulting video. Then, we compress scaled video with seven different bitrates (approximately 100, 300, 600, 1000, 2000, 4000, and 6000 kbps). The resulting videos are transcoded to .png sequences and given as an input to a Super-Resolution model.

In our benchmark we test 2x upscale, however, there are some Super-Resolution models which can only do 4x upscale. In this case, we downscale these models’ results twice by using FFmpeg with the flags::gauss option.

Figure 5. SR results evaluation steps

We also compress FullHD GT video without scaling to make “only compressed” results.

Figure 6. "only compressed" evaluation steps

Next, we calculate each metric for each result (including “only compressed”). We calculate shifted Y-PSNR, shifted YUV-MS-SSIM, Y-VMAF, Y-VMAF NEG, LPIPS, and ERQAv1.0. Then, we build RD curves (see Figure 7) and calculate BSQ-rate[8] (bitrate-for-the-same-quality rate) for each metric (see Figure 8). We take the “only compressed” result as a reference during the calculations.

Figure 7. RD curve

Figure 8. BSQ-rate

There are 3 ways we calculate BSQ-rate:

  1. Relative to x264 - for each codec we calculate the average BSQ-rate relative to “only compressed” made by x264 codec;
  2. Relative to self - for each codec we calculate the average BSQ-rate relative to “only compressed” made by this codec;
  3. Max. relative to self - for each codec we calculate BSQ-rate relative to “only compressed” made by this codec and take the best result of all sequences;

Subjective comparison

For subjective comparison, we have chosen 1 codec (x264) 3 different bitrates (1000, 2000, 4000 kbps). We cut sequences of 24 frames and convert them to videos with 8 fps by FFmpeg. Then we took 2 crops with resolution 320×270 from each video and conducted a side-by-side subjective comparison for all these pieces by Subjectify.us[10]. Each one of 1934 participants has seen 25 video pairs and had to choose which one of them is clearer (option “indistinguishable” is also available). There were 3 verification questions to protect against random answers and bots. You can see the current number of valid answers in Table 1. We used these valid answers to predict the ranking using the Bradley-Terry model.

Figure 9. Crops used for subjective comparison

Subjective BSQ-rate calculation

To calculate subjective BSQ-rate we extrapolated subjective results using the most similar objective metric. To do this we take the subjective results on 3 bitrates used for subjective comparison, find the objective metric that has the highest correlation with the subjective one on the same bitrates, and extrapolate subjective metric using this objective metric as a reference (see Figures 10a and 10b).

Figure 10a. Subjective metric extrapolation

Figure 10b. The most similar objective metric

Computational complexity

We run each model on NVIDIA Titan RTX and calculated runtime on the same test sequence:

We calculate seconds per iteration (s/it) as the execution time of a total model runtime divided by the number of sequence frames.


  1. https://compression.ru/video/codec_comparison/2021/
  2. http://compression.ru/video/quality_measure/video_measurement_tool.html
  3. https://github.com/richzhang/PerceptualSimilarity
  4. R. Zhang, P. Isola, A. A. Efros, E. Shechtman, O. Wang, "The unreasonable effectiveness of deep features as a perceptual metric," In Proceedings of the IEEE conference on computer vision and pattern recognition, 2020, pp.586-595.
  5. https://videoprocessing.ai/benchmarks/video-super-resolution.html
  6. https://docs.opencv.org/3.4/dd/d1a/group__imgproc__feature.html#ga04723e007ed888ddf11d9ba04e2232de
  7. https://en.wikipedia.org/wiki/Canny_edge_detector
  8. A. V. Zvezdakova, D. L. Kulikov, S. V. Zvezdakov, D. S. Vatolin, "BSQ-rate: a new approach for video-codec performance comparison and drawbacks of current solutions," Programming and computer software, vol. 46, 2020, pp.183-194.
  9. https://videoprocessing.ai/benchmarks/video-super-resolution-methodology.html
  10. http://app.subjectify.us/
  11. https://github.com/fraunhoferhhi/vvenc
  12. https://github.com/uavs3/uavs3e
  13. A. Antsiferova, A. Yakovenko, N. Safonov, D. Kulikov, A. Gushin, D.Vatolin, "Objective video quality metrics application to video codecs comparisons: choosing the best for subjective quality estimation," arXiv preprint arXiv:2107.10220, 2021
31 Aug 2021
See Also
MSU 3D-video Quality Analysis. Report 12
MSU Video Upscalers Benchmark 2021
The most comprehensive comparison of video super resolution (VSR) algorithms by subjective quality
MSU Video Upscalers Benchmark Participants
The list of the participants of the MSU Video Upscalers Benchmark
MSU Video Upscalers Benchmark Methodology
The methodology of the MSU Video Upscalers Benchmark
MSU Video Alignment and Retrieval Benchmark
Explore the best algorithms in different video alignment tasks
MSU Video Alignment and Retrieval Benchmark Suite Participants
List of participants of MSU Video Alignment and Retrieval Benchmark Suite
Site structure