Evaluation methodology of MSU Video Super Resolution Benchmark 2021
Super Resolution is the process of calculating high-resolution samples from their low-resolution counterparts. Working with images we can utilize natural preferences and make a high-resolution image, which is only in a way similar to the real one. With video, we can explore additional information from neighboring frames and restore details from the original scene. Our benchmark is aimed to find the best algorithms for the restoration of real details during Video Super Resolution processing.
To analyze models’ ability to restore real details, we have built a test-stand (see Figure 1), which includes different hard patterns for video restoration tasks:
Figure 1. The test-stand of the benchmark.
To calculate metrics on particular types of content and verify how models work with different input, we divide each output
frame into parts by detection of crosses:
Part 1. “Board” includes a few small objects and photos of people's faces*. We want to see models’ results on textures with small details. The striped fabric and balls of yarn may produce a Moire pattern (see figure 2). Restoration of people's faces is important for video surveillance.
Part 2. “QR-codes” consists of several QR-codes of different sizes to find the size of the smallest one which can be detected in the models' output frame. In a low-resolution frame, QR-code patterns may be blended. Thus models have difficulties restoring patterns.
Part 3. “Text” includes two types of text: handwriting and a set of symbols. It’s difficult to put all these difficult parts in the training dataset. So, all of them are new for the model and it needs to restore them.
Part 4. “Metal paper” contains foil that was intensively crumpled. It’s interesting because of light reflection, which strongly but periodically changes between frames.
Part 5. “Color lines” is a printed image with a large number of thin color lines. This is difficult for models because thin lines of similar colors are mixed in a low-resolution frame.
Part 6. ‘Car numbers” consists of a set of car number plates of different countries and different sizes**. This content type is important for the video surveillance field and dashcams development.
Part 7. “Noise” includes difficult noise patterns. Models cannot restore real ground-truth noise and each one produces a unique pattern.
Part 8. “Mira” includes difficult patterns for video restoration: a set of straight and curved lines of different thicknesses and in different directions.
*Photos were generated by .
**Car numbers are generated randomly and printed on paper.
Figure 2. Example of a Moire pattern on the “Board”.
The dataset includes three videos with different types of motion:
- Hand tremor — video shooting from the fixed point without a tripod (the photographer holds the camera in his hands). Because of the natural tremor of hands, there is a random small motion in frames
- Parallel motion — the camera is moving from side to side in parallel with test-stand
- Rotation — the camera is moving from side to side in a half-circle
Technical characteristics of the camera
Dataset was prepared with Canon EOS 7D. We take fast series of photos and consider it as a sequence of video frames. We store each video as a sequence of
frames in PNG format, which were converted from JPG by FFmpeg. The output of a model is also a sequence of frames, which we compare with GT sequence of
frames to verify the model's performance. Camera’s settings:
ISO – 4000
aperture – 400
resolution – 5184x3456
- Source video has resolution 5184x3456 and was stored in sRGB color space. Each video’s length is 100 frames.
- Ground-truth. Each video was degraded by bicubic interpolation to generate GT with resolution 1920x1280. It’s essential because many open-source models don’t have available code to process a large frame. Processing large frame is also time-consuming.
- Then input video was degraded from GT in two ways: bicubic interpolation (BI) and Gaussian blurring and downsampling (BD).
To verify how a model works with noisy data, we prepared noise counterparts for each input video. To generate realistic noise, we use python implementation 
of the noise model proposed in CBDNet by Liu et al . We need to set two parameters: one for the Poisson part of the noise and another for the Gauss part of
To estimate the level of real noise in our camera, we set a camera on a tripod and capture a sequence of 100 frames from a fixed point. Then we average the sequence to estimate a clean image. Thus we gain hundred of real noise examples. Then we chose parameters for generated noise so that the distributions of generated and real noise are similar (see Figure 3). Our parameters choice: sigma_s = 0.001, sigma_c = 0.035.
Figure 3. The distribution of real and generated noise.
Finally, we have 12 tests:
PSNR – commonly used metric based on pixels’ similarity. We noticed that a model, trained on one degradation type and tested on another type, can generate frames with a global shift relative to GT (see Figure 4). Thus we checked integer shifts from [-3,3] in both axes and choose the shift with maximal PSNR value. This maximal value is considered as a metric result in our benchmark.
Figure 4. On the left: The same crop from the model’s output and GT frame.
On the right: PSNR visualization for this crop.
We chose PSNR-Y because it’s more efficient than PSNR-RGB. Meanwhile, a correlation between these metrics is high. For metric calculation, we use the implementation from skimage.metrics. A higher metric value indicates better quality. The metric value for GT is infinite.
SSIM – another commonly used metric based on structure similarity. A shift of frames can influence this metric too. Thus we tried to find the optimal shift similarly to PSNR calculation and noticed that optimal shifts for these metrics can differ, but not more than 1 pixel in any axis. Because SSIM has large computational complexity, we decided to find optimal shift not among all shifts, but near with optimal shift for PSNR (in a distance of 1 pixel in any axis). We calculate SSIM on the Y channel of the YUV color space. For metric calculation, we use the implementation from skimage.metrics. A higher metric value indicates better quality. The metric value for GT is 1.
ERQAv1.0 (Edge Restoration Quality Assessment, version 1.0) estimates how well a model has restored edges of the high-resolution frame. Firstly, we find edges in both output and GT frames. To do it we use OpenCV implementation of the Canny algorithm. A threshold for the initial finding of strong edges is set to 200. And a threshold for edge linking is set to 100. These coefficients allow to highlight edges of all subjects even of small sizes but skip lines, which are not important (see Figure 5).
Figure 5. An example of edges, highlighted by the chosen algorithm
Then we compare these edges by using F1-score. To compensate a one-pixel shift of edge, which is not essential for human perception of objects, we consider as true-positive pixels of output’s edges, which are not in edges of GT but are near (on the difference of one pixel) with the edge of GT(see Figure 6). A higher metric value indicates better quality. The metric value for GT is 1.
Figure 6. Visualization of F1-score, used for edges comparison
QRCRv1.0 (QR-Codes Restoration, version 1.0) finds the smallest size (in pixels) of QR-code, which can be detected in output frames of a model. To project metric values on [0,1], we consider a relation of the smallest QRs’ sizes for GT and output frame (see Figure 7). If in the model’s result we can’t detect any QR-code, the metric value is set to 0. A higher metric value indicates better quality. The metric value for GT is 1.
Figure 7. Example of detected crosses in output and GT frame.
The metric value for the output frame is 0.65
CRRMv1.0 (Colorfullness Reduced-Reference Metric, version 1.0) – calculate colorfullness* in both frames and compare them. To calculate colorfullness we
use metric, proposed by Hasler et al.. Comparison of colorfullness levels is performed as a relation between colorfullness in GT frame and output frame.
Then to project metric on [0,1] and penalize both increasing and decreasing of colorfullness, we take the absolute difference between 1 and the relation and
then subtract it from 1. A higher metric value indicates better quality. The metric value for GT is 1.
*Colorfulness measures how colorful an image is: if it’s bright and has a lot of different colors.
Because each model can work differently on different content types, we consider metric values not only on full-frame but also on parts with different content.
To do this we detect crosses in frames and calculate coordinates of all parts from them.
Crosses in some frames are distorted and cannot be detected. Thus we choose keyframes, where we can detect all crosses and calculate metrics only on these keyframes. We noticed that metrics values on these keyframes are highly correlated and choose the mean of values through keyframes as a final metric value for each test case.
We cut the sequence of 30 frames and convert them to video with fps 8 by FFmpeg. Then we crop 10 small pieces from the video and conduct a side-by-side
subjective comparison for all these pieces by subjectify.us. Each participant sees twenty-five paired videos and has to choose a video with better-restored
details in each pair (the option “indistinguishable” is also available). Three of these videos are considered as verification questions*. The rest answers of
successful participants are used to predict the ranking using the Bradley-Terry model.
*Answers to verification questions are not included in the final result.
The computational complexity of models
We tested each model using NVIDIA Titan RTX and measured runtime on the same test sequence:
- Test case — parallel motion + BD degradation + with noise
- 100 frames
- Input resolution — 480×320
- Shi Guo, Zifei Yan, Kai Zhang, Wangmeng Zuo and Lei Zhang, "Toward Convolutional Blind Denoising of Real Photographs," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1712-1722, doi: 10.1109/CVPR.2019.00181.
- David Hasler and Sabine Suesstrunk, "Measuring Colourfulness in Natural Images," Proceedings of SPIE - The International Society for Optical Engineering, 2003, volume 5007, pp. 87-95, doi: 10.1117/12.477378.
- Codecs Comparison & Optimization
- Video Filters
- Video Quality Measurement Tool 3D
MSU Datasets Collection
MSU Benchmark Collection