Evaluation methodology of MSU Video Super Resolution Benchmark 2021

You can read the Methodology below or download the presentation in pdf format here.
You also can see it in Google Slides here.

Problem definition

Super Resolution is the process of calculating high-resolution samples from their low-resolution counterparts. Working with images we can utilize natural preferences and make a high-resolution image, which is only in a way similar to the real one. With video, we can explore additional information from neighboring frames and restore details from the original scene. Our benchmark is aimed to find the best algorithms for the restoration of real details during Video Super Resolution processing.


Content types

To analyze models’ ability to restore real details, we have built a test-stand (see Figure 1), which includes different hard patterns for video restoration tasks:

Figure 1. The test-stand of the benchmark.

To calculate metrics on particular types of content and verify how models work with different input, we divide each output frame into parts by detection of crosses:
Part 1. “Board” includes a few small objects and photos of people's faces*. We want to see models’ results on textures with small details. The striped fabric and balls of yarn may produce a Moire pattern (see figure 2). Restoration of people's faces is important for video surveillance.
Part 2. “QR-codes” consists of several QR-codes of different sizes to find the size of the smallest one which can be detected in the models' output frame. In a low-resolution frame, QR-code patterns may be blended. Thus models have difficulties restoring patterns.
Part 3. “Text” includes two types of text: handwriting and a set of symbols. It’s difficult to put all these difficult parts in the training dataset. So, all of them are new for the model and it needs to restore them.
Part 4. “Metal paper” contains foil that was intensively crumpled. It’s interesting because of light reflection, which strongly but periodically changes between frames.
Part 5. “Color lines” is a printed image with a large number of thin color lines. This is difficult for models because thin lines of similar colors are mixed in a low-resolution frame.
Part 6. ‘Car numbers” consists of a set of car number plates of different countries and different sizes**. This content type is important for the video surveillance field and dashcams development.
Part 7. “Noise” includes difficult noise patterns. Models cannot restore real ground-truth noise and each one produces a unique pattern.
Part 8. “Mira” includes difficult patterns for video restoration: a set of straight and curved lines of different thicknesses and in different directions.
*Photos were generated by [1].
**Car numbers are generated randomly and printed on paper.

Figure 2. Example of a Moire pattern on the “Board”.

Motion types

The dataset includes three videos with different types of motion:

Technical characteristics of the camera

Dataset was prepared with Canon EOS 7D. We take fast series of photos and consider it as a sequence of video frames. We store each video as a sequence of frames in PNG format, which were converted from JPG by FFmpeg. The output of a model is also a sequence of frames, which we compare with GT sequence of frames to verify the model's performance. Camera’s settings:
ISO – 4000
aperture – 400
resolution – 5184x3456

Dataset preparation

Noise input

To verify how a model works with noisy data, we prepared noise counterparts for each input video. To generate realistic noise, we use python implementation [2] of the noise model proposed in CBDNet by Liu et al [3]. We need to set two parameters: one for the Poisson part of the noise and another for the Gauss part of the noise.
To estimate the level of real noise in our camera, we set a camera on a tripod and capture a sequence of 100 frames from a fixed point. Then we average the sequence to estimate a clean image. Thus we gain hundred of real noise examples. Then we chose parameters for generated noise so that the distributions of generated and real noise are similar (see Figure 3). Our parameters choice: sigma_s = 0.001, sigma_c = 0.035.

Figure 3. The distribution of real and generated noise.

Finally, we have 12 tests:



PSNR – commonly used metric based on pixels’ similarity. We noticed that a model, trained on one degradation type and tested on another type, can generate frames with a global shift relative to GT (see Figure 4). Thus we checked integer shifts from [-3,3] in both axes and choose the shift with maximal PSNR value. This maximal value is considered as a metric result in our benchmark.

Figure 4. On the left: The same crop from the model’s output and GT frame.
On the right: PSNR visualization for this crop.

We chose PSNR-Y because it’s more efficient than PSNR-RGB. Meanwhile, a correlation between these metrics is high. For metric calculation, we use the implementation from skimage.metrics[4]. A higher metric value indicates better quality. The metric value for GT is infinite.


SSIM – another commonly used metric based on structure similarity. A shift of frames can influence this metric too. Thus we tried to find the optimal shift similarly to PSNR calculation and noticed that optimal shifts for these metrics can differ, but not more than 1 pixel in any axis. Because SSIM has large computational complexity, we decided to find optimal shift not among all shifts, but near with optimal shift for PSNR (in a distance of 1 pixel in any axis). We calculate SSIM on the Y channel of the YUV color space. For metric calculation, we use the implementation from skimage.metrics[5]. A higher metric value indicates better quality. The metric value for GT is 1.


ERQAv1.0 (Edge Restoration Quality Assessment, version 1.0) estimates how well a model has restored edges of the high-resolution frame. Firstly, we find edges in both output and GT frames. To do it we use OpenCV implementation[6] of the Canny algorithm[7]. A threshold for the initial finding of strong edges is set to 200. And a threshold for edge linking is set to 100. These coefficients allow to highlight edges of all subjects even of small sizes but skip lines, which are not important (see Figure 5).

Figure 5. An example of edges, highlighted by the chosen algorithm

Then we compare these edges by using F1-score. To compensate a one-pixel shift of edge, which is not essential for human perception of objects, we consider as true-positive pixels of output’s edges, which are not in edges of GT but are near (on the difference of one pixel) with the edge of GT(see Figure 6). A higher metric value indicates better quality. The metric value for GT is 1.

Figure 6. Visualization of F1-score, used for edges comparison


QRCRv1.0 (QR-Codes Restoration, version 1.0) finds the smallest size (in pixels) of QR-code, which can be detected in output frames of a model. To project metric values on [0,1], we consider a relation of the smallest QRs’ sizes for GT and output frame (see Figure 7). If in the model’s result we can’t detect any QR-code, the metric value is set to 0. A higher metric value indicates better quality. The metric value for GT is 1.

Figure 7. Example of detected crosses in output and GT frame.
The metric value for the output frame is 0.65


CRRMv1.0 (Colorfullness Reduced-Reference Metric, version 1.0) – calculate colorfullness* in both frames and compare them. To calculate colorfullness we use metric, proposed by Hasler et al.[8]. Comparison of colorfullness levels is performed as a relation between colorfullness in GT frame and output frame. Then to project metric on [0,1] and penalize both increasing and decreasing of colorfullness, we take the absolute difference between 1 and the relation and then subtract it from 1. A higher metric value indicates better quality. The metric value for GT is 1.
*Colorfulness measures how colorful an image is: if it’s bright and has a lot of different colors.

Metrics accumulation

Because each model can work differently on different content types, we consider metric values not only on full-frame but also on parts with different content. To do this we detect crosses in frames and calculate coordinates of all parts from them.
Crosses in some frames are distorted and cannot be detected. Thus we choose keyframes, where we can detect all crosses and calculate metrics only on these keyframes. We noticed that metrics values on these keyframes are highly correlated and choose the mean of values through keyframes as a final metric value for each test case.

Subjective comparison

We cut the sequence of 30 frames and convert them to video with fps 8 by FFmpeg. Then we crop 10 small pieces from the video and conduct a side-by-side subjective comparison for all these pieces by subjectify.us. Each participant sees twenty-five paired videos and has to choose a video with better-restored details in each pair (the option “indistinguishable” is also available). Three of these videos are considered as verification questions*. The rest answers of successful participants are used to predict the ranking using the Bradley-Terry model.
*Answers to verification questions are not included in the final result.

The computational complexity of models

We tested each model using NVIDIA Titan RTX and measured runtime on the same test sequence:

FPS is calculated as the execution time of a full model runtime divided by the number of sequence frames.


  1. https://thispersondoesnotexist.com
  2. https://github.com/yzhouas/CBDNet_ISP
  3. Shi Guo, Zifei Yan, Kai Zhang, Wangmeng Zuo and Lei Zhang, "Toward Convolutional Blind Denoising of Real Photographs," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1712-1722, doi: 10.1109/CVPR.2019.00181.
  4. https://scikit-image.org/docs/stable/api/skimage.metrics.html#skimage.metrics.peak_signal_noise_ratio
  5. https://scikit-image.org/docs/stable/api/skimage.metrics.html#skimage.metrics.structural_similarity
  6. https://docs.opencv.org/3.4/dd/d1a/group__imgproc__feature.html#ga04723e007ed888ddf11d9ba04e2232de
  7. https://en.wikipedia.org/wiki/Canny_edge_detector
  8. David Hasler and Sabine Suesstrunk, "Measuring Colourfulness in Natural Images," Proceedings of SPIE - The International Society for Optical Engineering, 2003, volume 5007, pp. 87-95, doi: 10.1117/12.477378.
26 Apr 2021
See Also
MSU 3D-video Quality Analysis. Report 12
MSU Video Upscalers Benchmark 2021
The most comprehensive comparison of video super resolution (VSR) algorithms by subjective quality
MSU Video Upscalers Benchmark Participants
The list of the participants of the MSU Video Upscalers Benchmark
MSU Video Upscalers Benchmark Methodology
The methodology of the MSU Video Upscalers Benchmark
MSU Video Alignment and Retrieval Benchmark
Explore the best algorithms in different video alignment tasks
MSU Video Alignment and Retrieval Benchmark Suite Participants
List of participants of MSU Video Alignment and Retrieval Benchmark Suite
Site structure