The Methodology of the Subjective QRS Benchmark

Video Sources

Acquisition of a representative dataset involves selecting a large number of videos with suitable technical parameters (length, license, etc.), then making representative subsamples using clustering and feature selection methods; after that, we plan to apply distortions to the resulting videos and the resulting representative set becomes ready for subjective comparison.

Requirements to Video Sources

Only videos with CC-BY or CC-0 licenses (re-distribution rights).
Duration: at least 10 seconds.
Frame rate: at least 10 FPS.

Selected Video Sources

Vimeo
FineVideo
YouTube

Video Pre-processing

No more than 2 slices are selected from the list of scenes detected using AutoShot^[1].
The number of scenes in each slice is minimal, so that the total length of the fragment is at least 15 seconds.
Slices are selected to be equidistant from each other, the beginning, and the end of the video; final fragments are trimmed to 15 seconds in the center.
Encoding (FFmpeg): -vcodec libx264 -pix_fmt yuv420p -crf 0 -preset medium -sn -dn
Audio (if present): -acodec aac -audio_bitrate 256k
Frames resized to 1080 pixels on the smaller side while keeping the aspect ratio. So the final resolution was 1920×1080 and 1080×1920

Feature Extraction

To properly select videos for representative subsampling we utilized various feature extraction methods.

General IQA/VQA Metrics

Luminance histogram features: luminance quantiles, dark and bright pixel ratios, noise ratio, and entropy
Hasler–Suesstrunk^[2]
Std-Luminance
CLIP-IQA+^[3]
TOPIQ^[4]
LAION Aesthetic^[5]
PaQ-2-PiQ^[6]
StableVQA^[7]

VQMT^[8] Metrics

SI and TI
Blurring
Blocking

Semantic Features and Labeling

YOLOv12-L^[9]
Places365^[10]
SigLIP-2^[11]
InternVL3^[12]

Feature-Based Video Clustering

Clustering Techniques

Each video is represented by a unified feature vector combining numerical quality metrics, categorical semantic descriptors, and InternVideo embeddings.
Numerical features are summarized by the mean and standard deviation of frame-level quality metrics.
Categorical features are constructed from annotations using occurrence or confidence-weighted counts, then log-scaled and ℓ2-normalized: \(x_c \leftarrow \dfrac{x_c}{\left\lVert \log(1+x_c)\right\rVert_2 + \varepsilon}\)
Embeddings are normalized to unit length: \(x_e \leftarrow \dfrac{x_e}{\left\lVert x_e\right\rVert_2 + \varepsilon}\)
All components are concatenated and globally ℓ2-normalized to obtain the final representation x.
Clustering: k-means in FAISS^[13] with k = 1000 clusters and 50 training iterations, minimizing squared Euclidean distance in the normalized feature space.
After clustering, each video is assigned to its nearest centroid; the video with the smallest distance to the centroid is selected as the representative of each cluster, yielding a diverse and balanced subset of videos.
The result is a selection of 1000 videos for inference methods and comparison using a subjective quality rating system (QRS).

Distribution of cluster sizes after k-means clustering.

Technical specifications

Technical specifications of the system for the inference of methods:

CPU AMD EPYC 7532 32-Core Processor
RAM 500 GB
GPU NVIDIA A100 80 GB

Subscribe to this benchmark's updates using the form and get notified when the paper will be available.

References

Zhu W., Huang Y., Xie X., Liu W., Deng J., Zhang D., Wang Z., Liu J., “Autoshot: A short video dataset and state-of-the-art shot boundary detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
Hasler D., Süsstrunk S. E., “Measuring Colorfulness in Natural Images,” in Human Vision and Electronic Imaging VIII, 2003.
Wang J., Chan K. C., Loy C. C., “Exploring CLIP for Assessing the Look and Feel of Images,” in Proceedings of the AAAI conference on artificial intelligence, 2023.
Chen C., Mo J., Hou J., Wu H., Liao L., Sun W., Yan Q., Lin W., “TOPIQ: A Top-Down Approach From Semantics to Distortions for Image Quality Assessment,” in IEEE Transactions on Image Processing, 2024.
LAION AI, “LAION Aesthetic Predictor,” in github.com/LAION-AI/aesthetic-predictor, 2024.
Ying Z., Niu H., Gupta P., Mahajan D., Ghadiyaram D., Bovik A., “From Patches to Pictures (PaQ-2-PiQ): Mapping the Perceptual Space of Picture Quality,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.
Kou T., Liu X., Sun W., Jia J., Min X., Zhai G., Liu N., “StableVQA: A Deep No-Reference Quality Assessment Model for Video Stability,” in Proceedings of the 31st ACM international conference on multimedia, 2023.
MSU, “MSU Video Quality Measurement Tool,” in www.compression.ru/video/quality_measure/video_measurement_tool.html.
Ultralytics, “Ultralytics YOLO Documentation,” in docs.ultralytics.com, 2025.
Zhou B., Lapedriza A., Khosla A., Oliva A., Torralba A., “Places: A 10 million Image Database for Scene Recognition,” in IEEE transactions on pattern analysis and machine intelligence, 2017.
Tschannen M., Gritsenko A., Wang X., Naeem M. F., Alabdulmohsin I., Parthasarathy N., Evans T., Beyer L., Xia Y., Mustafa B., H’enaff O., Harmsen J., Steiner A., Zhai X., “SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features,” in arXiv preprint arXiv:2502.14786, 2025.
Zhu J., Wang W., Chen Z., Liu Z., Ye S., Gu L., Tian H., Duan Y., Su W., Shao J., others, “InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models,” in arXiv preprint arXiv:2504.10479, 2025.
Douze M., Guzhva A., Deng C., Johnson J., Szilvasy G., Mazaré P.-E., Lomeli M., Hosseini L., Jégou H., “The Faiss library,” in arXiv:2401.08281, 2024.

02 Mar 2026

Video processing, compression and quality research group Based in MSU Graphics & Media Laboratory