The Methodology of the Subjective QRS Benchmark

Video Sources

Acquisition of a representative dataset involves selecting a large number of videos with suitable technical parameters (length, license, etc.), then making representative subsamples using clustering and feature selection methods; after that, we plan to apply distortions to the resulting videos and the resulting representative set becomes ready for subjective comparison.

Requirements to Video Sources

  • Only videos with CC-BY or CC-0 licenses (re-distribution rights).
  • Duration: at least 10 seconds.
  • Frame rate: at least 10 FPS.

Selected Video Sources

  • Vimeo
  • FineVideo
  • YouTube

Video Pre-processing

  • No more than 2 slices are selected from the list of scenes detected using AutoShot.
  • The number of scenes in each slice is minimal, so that the total length of the fragment is at least 15 seconds.
  • Slices are selected to be equidistant from each other, the beginning, and the end of the video; final fragments are trimmed to 15 seconds in the center.
  • Encoding (FFmpeg): -vcodec libx264 -pix_fmt yuv420p -crf 0 -preset medium -sn -dn
  • Audio (if present): -acodec aac -audio_bitrate 256k
  • Frames resized to 1080 pixels on the smaller side while keeping the aspect ratio. So the final resolution was 1920×1080 and 1080×1920

Feature Extraction

To properly select videos for representative subsampling we utilized various feature extraction methods.

General IQA/VQA Metrics

  • Luminance histogram features: luminance quantiles, dark and bright pixel ratios, noise ratio, and entropy
  • Hasler–Suesstrunk
  • Std-Luminance
  • CLIP-IQA+
  • TOPIQ
  • LAION Aesthetic
  • PaQ-2-PiQ
  • StableVQA

VQMT Metrics

  • SI and TI
  • Blurring
  • Blocking

Semantic Features and Labeling

  • YOLOv12-L
  • Places365
  • SigLIP-2
  • InternVL3

Feature-Based Video Clustering

Clustering Techniques

  • Each video is represented by a unified feature vector combining numerical quality metrics, categorical semantic descriptors, and InternVideo embeddings.
  • Numerical features are summarized by the mean and standard deviation of frame-level quality metrics.
  • Categorical features are constructed from annotations using occurrence or confidence-weighted counts, then log-scaled and ℓ2-normalized: \(x_c \leftarrow \dfrac{x_c}{\left\lVert \log(1+x_c)\right\rVert_2 + \varepsilon}\)
  • Embeddings are normalized to unit length: \(x_e \leftarrow \dfrac{x_e}{\left\lVert x_e\right\rVert_2 + \varepsilon}\)
  • All components are concatenated and globally ℓ2-normalized to obtain the final representation x.
  • Clustering: k-means in FAISS with k = 1000 clusters and 50 training iterations, minimizing squared Euclidean distance in the normalized feature space.
  • After clustering, each video is assigned to its nearest centroid; the video with the smallest distance to the centroid is selected as the representative of each cluster, yielding a diverse and balanced subset of videos.
  • The result is a selection of 1000 videos for inference methods and comparison using a subjective quality rating system (QRS).

Distribution of cluster sizes after k-means clustering.

Technical specifications

Technical specifications of the system for the inference of methods:

  • CPU AMD EPYC 7532 32-Core Processor
  • RAM 500 GB
  • GPU NVIDIA A100 80 GB

Subscribe to this benchmark's updates using the form and get notified when the paper will be available.

02 Mar 2026
See Also
PSNR and SSIM: application areas and criticism
Learn about limits and applicability of the most popular metrics
Super-Resolution Quality Metrics Benchmark
Discover 50 Super-Resolution Quality Metrics and choose the most appropriate for your videos
Super-Resolution Quality Metrics Benchmark
Discover 50 Super-Resolution Quality Metrics and choose the most appropriate for your videos
Video Colorization Benchmark
Explore the best video colorization algorithms
Video Saliency Prediction Benchmark
Explore the best video saliency prediction (VSP) algorithms
LEHA-CVQAD Video Quality Metrics Benchmark
Explore newest Full- and No-Reference Video Quality Metrics and find the most appropriate for you.
Site structure