The Methodology of the Subjective QRS Benchmark
Video Sources
Acquisition of a representative dataset involves selecting a large number of videos with suitable technical parameters (length, license, etc.), then making representative subsamples using clustering and feature selection methods; after that, we plan to apply distortions to the resulting videos and the resulting representative set becomes ready for subjective comparison.
Requirements to Video Sources
- Only videos with CC-BY or CC-0 licenses (re-distribution rights).
- Duration: at least 10 seconds.
- Frame rate: at least 10 FPS.
Selected Video Sources
- Vimeo
- FineVideo
- YouTube
Video Pre-processing
- No more than 2 slices are selected from the list of scenes detected using AutoShot[1].
- The number of scenes in each slice is minimal, so that the total length of the fragment is at least 15 seconds.
- Slices are selected to be equidistant from each other, the beginning, and the end of the video; final fragments are trimmed to 15 seconds in the center.
- Encoding (FFmpeg):
-vcodec libx264 -pix_fmt yuv420p -crf 0 -preset medium -sn -dn - Audio (if present):
-acodec aac -audio_bitrate 256k - Frames resized to 1080 pixels on the smaller side while keeping the aspect ratio. So the final resolution was 1920×1080 and 1080×1920
Feature Extraction
To properly select videos for representative subsampling we utilized various feature extraction methods.
General IQA/VQA Metrics
- Luminance histogram features: luminance quantiles, dark and bright pixel ratios, noise ratio, and entropy
- Hasler–Suesstrunk[2]
- Std-Luminance
- CLIP-IQA+[3]
- TOPIQ[4]
- LAION Aesthetic[5]
- PaQ-2-PiQ[6]
- StableVQA[7]
VQMT Metrics[8]
- SI and TI
- Blurring
- Blocking
Semantic Features and Labeling
Feature-Based Video Clustering
Clustering Techniques
- Each video is represented by a unified feature vector combining numerical quality metrics, categorical semantic descriptors, and InternVideo embeddings.
- Numerical features are summarized by the mean and standard deviation of frame-level quality metrics.
- Categorical features are constructed from annotations using occurrence or confidence-weighted counts, then log-scaled and ℓ2-normalized: \(x_c \leftarrow \dfrac{x_c}{\left\lVert \log(1+x_c)\right\rVert_2 + \varepsilon}\)
- Embeddings are normalized to unit length: \(x_e \leftarrow \dfrac{x_e}{\left\lVert x_e\right\rVert_2 + \varepsilon}\)
- All components are concatenated and globally ℓ2-normalized to obtain the final representation
x. - Clustering: k-means in FAISS[13] with k = 1000 clusters and 50 training iterations, minimizing squared Euclidean distance in the normalized feature space.
- After clustering, each video is assigned to its nearest centroid; the video with the smallest distance to the centroid is selected as the representative of each cluster, yielding a diverse and balanced subset of videos.
- The result is a selection of 1000 videos for inference methods and comparison using a subjective quality rating system (QRS).
Distribution of cluster sizes after k-means clustering.
Technical specifications
Technical specifications of the system for the inference of methods:
- CPU AMD EPYC 7532 32-Core Processor
- RAM 500 GB
- GPU NVIDIA A100 80 GB
Subscribe to this benchmark's updates using the form and get notified when the paper will be available.
- Zhu W., Huang Y., Xie X., Liu W., Deng J., Zhang D., Wang Z., Liu J., “Autoshot: A short video dataset and state-of-the-art shot boundary detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Hasler D., Süsstrunk S. E., “Measuring Colorfulness in Natural Images,” in Human Vision and Electronic Imaging VIII, 2003.
- Wang J., Chan K. C., Loy C. C., “Exploring CLIP for Assessing the Look and Feel of Images,” in Proceedings of the AAAI conference on artificial intelligence, 2023.
- Chen C., Mo J., Hou J., Wu H., Liao L., Sun W., Yan Q., Lin W., “TOPIQ: A Top-Down Approach From Semantics to Distortions for Image Quality Assessment,” in IEEE Transactions on Image Processing, 2024.
- LAION AI, “LAION Aesthetic Predictor,” in github.com/LAION-AI/aesthetic-predictor, 2024.
- Ying Z., Niu H., Gupta P., Mahajan D., Ghadiyaram D., Bovik A., “From Patches to Pictures (PaQ-2-PiQ): Mapping the Perceptual Space of Picture Quality,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.
- Kou T., Liu X., Sun W., Jia J., Min X., Zhai G., Liu N., “StableVQA: A Deep No-Reference Quality Assessment Model for Video Stability,” in Proceedings of the 31st ACM international conference on multimedia, 2023.
- MSU, “MSU Video Quality Measurement Tool,” in www.compression.ru/video/quality_measure/video_measurement_tool.html.
- Ultralytics, “Ultralytics YOLO Documentation,” in docs.ultralytics.com, 2025.
- Zhou B., Lapedriza A., Khosla A., Oliva A., Torralba A., “Places: A 10 million Image Database for Scene Recognition,” in IEEE transactions on pattern analysis and machine intelligence, 2017.
- Tschannen M., Gritsenko A., Wang X., Naeem M. F., Alabdulmohsin I., Parthasarathy N., Evans T., Beyer L., Xia Y., Mustafa B., H’enaff O., Harmsen J., Steiner A., Zhai X., “SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features,” in arXiv preprint arXiv:2502.14786, 2025.
- Zhu J., Wang W., Chen Z., Liu Z., Ye S., Gu L., Tian H., Duan Y., Su W., Shao J., others, “InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models,” in arXiv preprint arXiv:2504.10479, 2025.
- Douze M., Guzhva A., Deng C., Johnson J., Szilvasy G., Mazaré P.-E., Lomeli M., Hosseini L., Jégou H., “The Faiss library,” in arXiv:2401.08281, 2024.
-
MSU Benchmark Collection
- Super-Resolution Quality Metrics Benchmark
- Super-Resolution Quality Metrics Benchmark
- Video Colorization Benchmark
- Video Saliency Prediction Benchmark
- LEHA-CVQAD Video Quality Metrics Benchmark
- Learning-Based Image Compression Benchmark
- Super-Resolution for Video Compression Benchmark
- Defenses for Image Quality Metrics Benchmark
- Deinterlacer Benchmark
- Metrics Robustness Benchmark
- Video Upscalers Benchmark
- Video Deblurring Benchmark
- Video Frame Interpolation Benchmark
- HDR Video Reconstruction Benchmark
- No-Reference Video Quality Metrics Benchmark
- Full-Reference Video Quality Metrics Benchmark
- Video Alignment and Retrieval Benchmark
- Mobile Video Codecs Benchmark
- Video Super-Resolution Benchmark
- Shot Boundary Detection Benchmark
- The VideoMatting Project
- Video Completion
- Codecs Comparisons & Optimization
- VQMT
- MSU Datasets Collection
- Metrics Research
- Video Quality Measurement Tool 3D
- Video Filters
- Other Projects