Evaluation methodology of MSU Super-Resolution for Video Compression Benchmark 2021

You can read the Methodology below or download the presentation in pdf format here.
You also can see it in Google Slides here.

Benchmark Statistics

Date

Participants

Sequences

Codecs

Valid answers in subjective comparison

31.08.2021

13

3

5

57943

Table 1: Benchmark statistics

Problem definition

Super-Resolution is the process of calculating high-resolution samples from their low-resolution counterparts. Due to the rapid development of Video Super-Resolution technologies, they are used in video codecs.

Different SR models have different bitrate/quality tradeoffs when working with compressed video sequences. Among the two SRs that produce results of the same subjective quality, the one that works with the lower bitrate input is better. Our benchmark aims to find the best Video Super-Resolution algorithm based on this criterion.

We are currently testing only 2x upscale, but we plan to test 4x upscale as well.

Dataset

Our dataset is constantly being updated. You can see the current number of sequences in the dataset in Table 1. Each FullHD video in yuv format is decoded with 7 different bitrates using 5 different codecs. Videos were taken from MSU codecs comparison[1] 2019 and 2020 test sets. The dataset contains videos in FullHD resolution with FPS from 24 to 30.

All videos have low SI/TI value and simple textures. It was made to minimize compression artifacts that may occur to make restoration of details possible.

Figure 1. Segments from dataset

Preview

Description

Format

Frames

FPS

Bitrate

Animation clip

2D animation advertising clip
drawn in bright colors.


FullHD


100


30 fps


104.58 Mbps

Skiing learning

People are being trained to ski
in slow motion.


FullHD


179


24 fps


107.59 Mbps

Street show

Two men sing, dance and perform
some acrobatics on a street.


FullHD


200


24 fps


108.40 Mbps

Table 2: Dataset characteristics

Metrics

PSNR

PSNR is a commonly used metric for reconstruction quality for images and video. In our benchmark, we calculate PSNR on the Y component in YUV colorspace.

Since some Super-Resolution models can generate images with a global shift relative to GT, we calculate shifted PSNR. We check each shift in the range [-3, 3] (including subpixel shifts) for both axes and select the highest PSNR value among these shifts. We noticed that SRs’ results on the same video decoded with different bitrates usually have the same global shift. Thus we calculate the best shift only once for each video.

For metric calculation, we use MSU VQMT[2].

MS-SSIM

SSIM is a metric based on structural similarity. In our benchmark, we use Multiscale SSIM (MS-SSIM), which is conducted over multiple scales through a process of multiple stages of sub-sampling. We calculate MS-SSIM on all 3 components in the YUV colorspace, and the metric result is calculated as (4Y + U + V) / 6[13], where Y, U, and V are the MS-SSIM values on Y, U, and V components respectively.

MS-SSIM results also rely on the shift of frames. We take optimal subpixel shift for PSNR and apply in to input frames before calculating MS-SSIM.

For metric calculation, we use MSU VQMT[2].

VMAF

VMAF is a perceptual video quality assessment algorithm developed by Netflix. We use both VMAF and VMAF NEG (no enhancement gain) in our benchmark.

For metric calculation, we use MSU VQMT[2]. For VMAF we use -set "disable_clip=True" option of MSU VQMT.

Shifted VMAF and VMAF NEG give less than 1% gain relative to unshifted versions, that’s why we use unshifted versions in our benchmark. In Figure 2a and 2b you can see the gain that each model get by using shifted VMAF and VMAF NEG relative to unshifted versions.

Figure 2a. Shifted VMAF gain of each model

Figure 2b. Shifted VMAF NEG gain of each model

LPIPS

LPIPS (Learned Perceptual Image Patch Similarity) evaluates the distance between image patches. Higher means further/more different. Lower means more similar. In our benchmark, we subtract LPIPS value from 1. Thus, more similar images have higher metric values.

To calculate LPIPS we use Perceptual Similarity Metric implementation[3] proposed in The Unreasonable Effectiveness of Deep Features as a Perceptual Metric[4].

We have also noticed, that shifted LPIPS give less than 1% gain relative to the unshifted version, as you can see in Figure 3.

Figure 3. Shifted LPIPS gain of each model

ERQA

ERQAv1.0 (Edge Restoration Quality Assessment, version 1.0) estimates how well a model has restored edges of the high-resolution frame. This metric was developed for MSU Video Super-Resolution Benchmark 2021[5].

Firstly, we find edges in both output and GT frames. To do it we use OpenCV implementation[6] of the Canny algorithm[7]. A threshold for the initial finding of strong edges is set to 200 and a threshold for edge linking is set to 100. Then we compare these edges by using an F1-score. To compensate for the one-pixel shift, edges that are no more than one pixel away from the GT's are considered true-positive.

More information about this metric can be found at the Evaluation Methodology of MSU Video Super-Resolution Benchmark[9].

Figure 4. ERQAv1.0 visualization.
White pixels are True Positive, red pixels are False Positive, blue pixels are False Negative

Codecs

To compress GT videos, we use the following codecs:

Codec

Standart

Implementation

x264

H.264

FFmpeg version 4.2.4

x265

H.265

FFmpeg version 4.2.4

aomenc

AV1

FFmpeg version 4.2.4

VVenC

H.266

Fraunhofer Versatile Video Encoder[11]

uavs3e

AVS3

uavs3e[12]

Table 3: Codecs' description

For x264, x265, aomenc, and VVenC we use -preset=“medium” option.

Evaluation

Firstly, we downscale our FullHD GT video using FFmpeg to make it 960×540 resolution. We use the flags::gauss option to keep more information in the resulting video. Then, we compress scaled video with seven different bitrates (approximately 100, 300, 600, 1000, 2000, 4000, and 6000 kbps). The resulting videos are transcoded to .png sequences and given as an input to a Super-Resolution model.

In our benchmark we test 2x upscale, however, there are some Super-Resolution models which can only do 4x upscale. In this case, we downscale these models’ results twice by using FFmpeg with the flags::gauss option.

Figure 5. SR results evaluation steps

We also compress FullHD GT video without scaling to make “only compressed” results.

Figure 6. "only compressed" evaluation steps

Next, we calculate each metric for each result (including “only compressed”). We calculate shifted Y-PSNR, shifted YUV-MS-SSIM, VMAF, VMAF NEG, LPIPS, and ERQAv1.0. Then, we build RD curves (see Figure 7) and calculate BSQ-rate[8] (bitrate-for-the-same-quality rate) for each metric (see Figure 8). We take the “only compressed” result as a reference during the calculations.

Figure 7. RD curve

Figure 8. BSQ-rate

There are 3 ways we calculate BSQ-rate:

  1. Relative to x264 - for each codec we calculate the average BSQ-rate relative to “only compressed” made by x264 codec;
  2. Relative to self - for each codec we calculate the average BSQ-rate relative to “only compressed” made by this codec;
  3. Max. relative to self - for each codec we calculate BSQ-rate relative to “only compressed” made by this codec and take the best result of all sequences;

Subjective comparison

For subjective comparison, we have chosen 1 codec (x264) 3 different bitrates (1000, 2000, 4000 kbps). We cut sequences of 24 frames and convert them to videos with 8 fps by FFmpeg. Then we took 2 crops with resolution 320×270 from each video and conducted a side-by-side subjective comparison for all these pieces by Subjectify.us[10]. Each one of 1934 participants has seen 25 video pairs and had to choose which one of them is clearer (option “indistinguishable” is also available). There were 3 verification questions to protect against random answers and bots. You can see the current number of valid answers in Table 1. We used these valid answers to predict the ranking using the Bradley-Terry model.

Figure 9. Crops used for subjective comparison

Subjective BSQ-rate calculation

To calculate subjective BSQ-rate we extrapolated subjective results using the most similar objective metric. To do this we take the subjective results on 3 bitrates used for subjective comparison, find the objective metric that has the highest correlation with the subjective one on the same bitrates, and extrapolate subjective metric using this objective metric as a reference (see Figures 10a and 10b).

Figure 10a. Subjective metric extrapolation

Figure 10b. The most similar objective metric

Computational complexity

We run each model on NVIDIA Titan RTX and calculated runtime on the same test sequence:

We calculate seconds per iteration (s/it) as the execution time of a total model runtime divided by the number of sequence frames.

Future work

In the future, we will maintain our benchmark in the following ways:

References

  1. http://compression.ru/video/codec_comparison/codec_comparison_en.html
  2. http://compression.ru/video/quality_measure/video_measurement_tool.html
  3. https://github.com/richzhang/PerceptualSimilarity
  4. R. Zhang, P. Isola, A. A. Efros, E. Shechtman, O. Wang, "The unreasonable effectiveness of deep features as a perceptual metric," In Proceedings of the IEEE conference on computer vision and pattern recognition, 2020, pp.586-595.
  5. https://videoprocessing.ai/benchmarks/video-super-resolution.html
  6. https://docs.opencv.org/3.4/dd/d1a/group__imgproc__feature.html#ga04723e007ed888ddf11d9ba04e2232de
  7. https://en.wikipedia.org/wiki/Canny_edge_detector
  8. A. V. Zvezdakova, D. L. Kulikov, S. V. Zvezdakov, D. S. Vatolin, "BSQ-rate: a new approach for video-codec performance comparison and drawbacks of current solutions," Programming and computer software, vol. 46, 2020, pp.183-194.
  9. https://videoprocessing.ai/benchmarks/video-super-resolution-methodology.html
  10. http://app.subjectify.us/
  11. https://github.com/fraunhoferhhi/vvenc
  12. https://github.com/uavs3/uavs3e
  13. A. Antsiferova, A. Yakovenko, N. Safonov, D. Kulikov, A. Gushin, D.Vatolin, "Objective video quality metrics application to video codecs comparisons: choosing the best for subjective quality estimation," arXiv preprint arXiv:2107.10220, 2021
31 Aug 2021
See Also
PSNR and SSIM: application areas and critics
Learn about limits and applicability of the most popular metrics
MSU Super-Resolution for Video Compression Benchmark
Learn about the best SR methods for compressed videos and choose the best model to use with your codec
MSU Super-Resolution for Video Compression Benchmark Participants
The list of participants of MSU Super-Resolution for Video Compression Benchmark
MSU Video Super Resolution Benchmark
Discover the newest VSR methods and find the most appropriate method for your tasks
MSU VSR Benchmark Participants
The list of participants of MSU Video Super Resolution Benchmark
MSU VSR Benchmark Methodology
The evaluation methodology of MSU Video Super Resolution Benchmark
Site structure