The Methodology of the Video Colorization Benchmark

Problem definition

The task of video colorization is to predict the colors of the input black and white video. Three groups of algorithms can be distinguished: fully automatic colorization, color propagation, and scribble-based propagation.

We mainly focused on color propagation algorithms, so we provide anchor frames for all architectures where it was possible. To check how many anchor frames were used for every algorithm check “Participants” section.


Video sources

The videos in our dataset were collected from two sources:

  • Some of the videos were taken from the platform Vimeo
  • Some of the video was shot specifically for the dataset with an Iphone 13 camera

The video clips were cut manually so that the color propagation methods had a chance to work in the domain they were designed for. We minimized the appearance of new objects in the frames, which information was not present in the first anchor frame.

The resolution of our videos is 1080p or 720p
The duration of the clips is from 20 to 200 frames.

For the final dataset, we calculated the values of the SI (spatial information) and TI (temporal information) metrics and compared with the DAVIS test which is often used in video colorization articles for comparison.

From the graph we can conclude that we have more complete coverage of videos in terms of SI, when DAVIS is not sufficiently complete. Since DAVIS was not created specifically to evaluate video colorization, TI values are too high for the task. You can see it in our visualisations, that it becomes difficult for the methods to work with the values we have chosen. For comparison, we used the VQMT[1] implementation of SI/TI metrics.

Full resolution and crops

We performed evaluation, including subjective, on both the original videos in source resolution and on crops. The goal was to compare how different resolutions will affect the perception of colorized videos and whether the methods' perfomance can be evaluated in the areas with the largest artifacts. To select the crops, we averaged the error map of the methods and croped around the highest values. From 36 selected source videos a total of 42 crops were obtained. We calculated PLCC between full resolution and crops overall subjective ranks and it equals to 0.99. This shows that for evaluation assessing only on crops is enough, which is good because of significant lower computational cost.

The Subjective Comparison

Crowd-sourced subjective comparison with over 2000 valid participants was conducted using Subjectify[2]. Frames from the clips showed to the participants are available in the “Visualizations” sections. Participants were to choose the most visually appealing clip in a pair. Participants were told to pay attention to color saturation, unrealistic color bleeding color washout, and flickering colors.

We used the Bradley–Terry model for calculating subjective quality ranks. The crops comparison was done with both videos on the screen, while the full resolution comparison was done with a sequential video display. Both such results ranked the methods almost equally.

Metrics and runtime


The following metrics were used for evaluation: PSNR, SSIM, LPIPS, ID, Colorfulness, CDC, WarpError. Their more detailed description can be found in the section about metrics.


We measured the runtime for each method on a 720p video. When measuring the speed, only the inference time was taken into account, the time of reading and writing frames was not counted. The measurements were performed on NVIDIA GeForce RTX 3090 graphics card.


  1. MSU Video Quality Measurement Tool
07 Nov 2023
See Also
MSU Image- and video-quality metrics analysis
Description of a project in MSU Graphics and Media Laboratory
Super-Resolution Quality Metrics Benchmark
Discover 66 Super-Resolution Quality Metrics and choose the most appropriate for your videos
Video Saliency Prediction Benchmark
Explore the best video saliency prediction (VSP) algorithms
Super-Resolution for Video Compression Benchmark
Learn about the best SR methods for compressed videos and choose the best model to use with your codec
Metrics Robustness Benchmark
Check your image or video quality metric for robustness to adversarial attacks
Video Upscalers Benchmark
The most extensive comparison of video super-resolution (VSR) algorithms by subjective quality
Site structure