Evaluation methodology of MSU Super-Resolution for Video Compression Benchmark 2022

Benchmark Statistics

Date

Participants

Sequences

Codecs

Valid answers in subjective comparison

12.04.2022

18

9

5

120 316

25.03.2022

18

9

5

57 943

14.03.2022

15

9

5

57 943

12.10.2021

13

7

5

57 943

21.09.2021

13

5

5

57 943

31.08.2021

13

3

5

57 943

Table 1: Benchmark statistics

Problem definition

Super-Resolution is the process of calculating high-resolution samples from their low-resolution counterparts. Due to the rapid development of Video Super-Resolution technologies, they are used in video codecs.

Different SR models have different bitrate/quality tradeoffs when working with compressed video sequences. If two SRs produce results of the same subjective quality, the one that works with the lower bitrate input is considered to be better. Our benchmark aims to find the best Video Super-Resolution algorithm based on this criterion.

We are currently testing only 4× upscale.

Dataset

Our dataset is constantly being updated. You can see the current number of sequences in the dataset in Table 1. Each FullHD video in yuv format is decoded with 6 different bitrates using 5 different codecs. Videos were recorded specifically for the benchmark or taken from MSU codecs comparison[1] 2019 and 2020 test sets. The dataset contains videos in FullHD resolution with FPS from 24 to 30.

All videos have low SI/TI value and simple textures. It was made to minimize compression artifacts that may occur to make restoration of details possible.

Note that 3 out of 9 these videos are private, which means that the results for these videos are available only in the upcoming enterprise report.

Figure 1. Segments from dataset

Metrics

PSNR

PSNR is a commonly used metric for reconstruction quality for images and video. In our benchmark, we calculate PSNR on the Y component in YUV colorspace.

Since some Super-Resolution models can generate images with a global shift relative to GT, we calculate shifted PSNR. We check each shift in the range [-3, 3] (including subpixel shifts) for both axes and select the highest PSNR value among these shifts. We noticed that SRs’ results on the same video decoded with different bitrates usually have the same global shift. Thus we calculate the best shift only once for each video.

For metric calculation, we use MSU VQMT[2].

MS-SSIM

SSIM is a metric based on structural similarity. In our benchmark, we use Multiscale SSIM (MS-SSIM), which is conducted over multiple scales through a process of multiple stages of sub-sampling. We calculate MS-SSIM on Y components in the YUV colorspace, since this method shows the best correlation with subjective scores[13].

These results also rely on the shift of frames. We take optimal subpixel shift for PSNR and apply in to input frames before calculating MS-SSIM.

For metric calculation, we use MSU VQMT[2].

VMAF

VMAF is a perceptual video quality assessment algorithm developed by Netflix. In our benchmark, we calculate VMAF on the Y component in YUV colorspace.

For metric calculation, we use MSU VQMT[2]. For VMAF we use -set "disable_clip=True" option of MSU VQMT.

LPIPS

LPIPS (Learned Perceptual Image Patch Similarity) evaluates the distance between image patches. Higher means further/more different. Lower means more similar. In our benchmark, we subtract LPIPS value from 1. Thus, more similar images have higher metric values.

To calculate LPIPS we use Perceptual Similarity Metric implementation[3] proposed in The Unreasonable Effectiveness of Deep Features as a Perceptual Metric[4].

ERQA

ERQAv2.0 (Edge Restoration Quality Assessment, version 2.0) estimates how well a model has restored edges of the high-resolution frame. This metric was developed for MSU Video Super-Resolution Benchmark 2021[5].

Firstly, we find edges in both output and GT frames. To do it we use OpenCV implementation[6] of the Canny algorithm[7]. A threshold for the initial finding of strong edges is set to 200 and a threshold for edge linking is set to 100. Then we compare these edges by using an F1-score. To compensate for the one-pixel shift, edges that are no more than one pixel away from the GT's are considered true-positive.

More information about this metric can be found at the Evaluation Methodology of MSU Video Super-Resolution Benchmark[9].

Figure 2. ERQAv2.0 visualization.
White pixels are True Positive, red pixels are False Positive, blue pixels are False Negative

Other metrics

We also calculate VMAF NEG (no enhancements gain), YUV-SSIM and YUV-PSNR. More information about these metrics can be found in our upcoming enterprise report.

Codecs

To compress GT videos, we use the following codecs:

Codec

Standart

Implementation

x264

H.264

FFmpeg version 2021-04-28-git-1ab74bc193-full_build-www.gyan.dev

x265

H.265

FFmpeg version 2021-04-28-git-1ab74bc193-full_build-www.gyan.dev

aomenc

AV1

FFmpeg version 2021-04-28-git-1ab74bc193-full_build-www.gyan.dev

VVenC

H.266

Fraunhofer Versatile Video Encoder[11]

uavs3e

AVS3

uavs3e[12]

Table 2: Codecs' description

For x264, x265, aomenc, and VVenC we use -preset=“medium” option.
To compress videos with uasv3e we use the following command:
uavs3e.exe -w {width} -h {height} -i {in_path} -o {out_path} -r {result_path} --rc_type 1
--rc_crf {crf} --fps_num {FPS} --fps_den 1 -d 8

Evaluation

Firstly, we downscale our FullHD GT video using FFmpeg to make it 480×270 resolution. We use the flags::bicubic option to keep more information in the resulting video. Then, we compress scaled video with seven different bitrates (approximately 100, 300, 600, 1000, 2000, and 4000 kbps). The resulting videos are transcoded to .png sequences and given as an input to a Super-Resolution model.

In our benchmark we test 4× upscale, however, there are some Super-Resolution models which can only do 2× upscale. In this case, we apply them twice.

Figure 3. SR results evaluation steps

Next, we calculate each metric for each result (including the results of applying codecs without downscaling). We calculate shifted Y-PSNR, shifted Y-MS-SSIM, Y-VMAF, LPIPS, and ERQAv2.0. Then, we build RD curves (see Figure 4) and calculate BSQ-rate[8] (bitrate-for-the-same-quality rate) for each metric (see Figure 5). We take the results of applying codecs without downscaling as a reference during the calculations.

Figure 4. RD curve

Figure 5. BSQ-rate

Subjective comparison

We have conducted 5 subjective comparisons for each codec separately. We took videos compressed with 3 different bitrates (approximately 600, 1000, and 2000 kbps) and cut one crop of size 480×270 from each video. The crops were chosen based on saliency maps generated by the method [14] proposed in Contextual encoder-decoder network for visual saliency prediction[15]. We selected 10 visually different SR models for each codec. Each one of 5397 participants has seen 25 video pairs and had to choose which one of them looks more realistic (option “indistinguishable” is also available). There were 3 verification questions to protect against random answers and bots. You can see the current number of valid answers in Table 1. We used these valid answers to predict the ranking using the Bradley-Terry model.

Figure 6. Crops used for subjective comparison.

Computational complexity

We run each model on NVIDIA Titan RTX and calculated runtime on the same test sequence:

  • 100 frames
  • Input resolution — 480×270
  • Test case:
    • video — dancing
    • codec — x264
    • bitrate — 836 kbps

We calculate frames per second (FPS) as the number of sequence frames divided by the execution time of a total model runtime.

References

  1. https://compression.ru/video/codec_comparison/2021/
  2. http://compression.ru/video/quality_measure/video_measurement_tool.html
  3. https://github.com/richzhang/PerceptualSimilarity
  4. R. Zhang, P. Isola, A. A. Efros, E. Shechtman, O. Wang, "The unreasonable effectiveness of deep features as a perceptual metric," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2020, pp.586-595.
  5. https://videoprocessing.ai/benchmarks/video-super-resolution.html
  6. https://docs.opencv.org/3.4/dd/d1a/group__imgproc__feature.html#ga04723e007ed888ddf11d9ba04e2232de
  7. https://en.wikipedia.org/wiki/Canny_edge_detector
  8. A. V. Zvezdakova, D. L. Kulikov, S. V. Zvezdakov, D. S. Vatolin, "BSQ-rate: a new approach for video-codec performance comparison and drawbacks of current solutions," in Programming and computer software, vol. 46, 2020, pp.183-194.
  9. https://videoprocessing.ai/benchmarks/video-super-resolution-methodology.html
  10. http://app.subjectify.us/
  11. https://github.com/fraunhoferhhi/vvenc
  12. https://github.com/uavs3/uavs3e
  13. A. Antsiferova, A. Yakovenko, N. Safonov, D. Kulikov, A. Gushin, D.Vatolin, "Objective video quality metrics application to video codecs comparisons: choosing the best for subjective quality estimation," in arXiv preprint arXiv:2107.10220, 2021
  14. https://github.com/alexanderkroner/saliency
  15. A. Kroner, M. Senden, K. Driessens, and R. Goebel, "Contextual encoder-decoder network for visual saliency prediction," in Neural Networks, 129, pp. 261-270, 2020.
14 Mar 2022
See Also
MSU Video Upscalers Benchmark 2022
The most extensive comparison of video super-resolution (VSR) algorithms by subjective quality
MSU HDR Video Reconstruction Benchmark 2022
The most comprehensive comparison of HDR video reconstruction methods
Site structure