The List of the Participants of the Subjective QRS Benchmark
Multimodal Distortion
Several models were used capable of solving several tasks at once, such as Super-Resolution, Deblurring and Denoising. To start one of the operating modes, it was necessary to specify the appropriate parameters.
Nonlinear Activation Free Network (NAFNet)
NAFNet[1] is a modern CNN restoration baseline that deliberately simplifies the building block by removing conventional nonlinear activations, relying instead on lightweight gating/attention-style operations inside a U-Net-like framework. NAFNet has ~29M parameters for its main model variant used in the evaluations. It is notable for achieving strong restoration quality with a comparatively simple and efficient architecture competitive with transformer backbones.
Video Restoration Transformer (VRT)
VRT[2] is a multi-scale video Transformer built from Temporal Mutual Self-Attention (TMSA) plus parallel warping. TMSA operates on short clips (mutual attention for alignment/fusion + self-attention for feature extraction) and uses sequence shifting to enable cross-clip interaction. VRT has 18.3M parameters for a 1280×720 LQ setting, and also highlights that VRT is comparatively heavy in memory/runtime versus recurrent designs.
Recurrent Video Restoration Transformer (RVRT)
RVRT[3] integrates parallel and recurrent strategies: it processes local clips in parallel inside a globally recurrent framework, and introduces Guided Deformable Attention (GDA) for clip-to-clip alignment/aggregation under flow guidance. RVRT has 13.6M parameters, while also noting that it achieves a strong trade-off between effectiveness, memory, and speed.
This model can be called a state-of-the-art in the field of Synthetic Video Super-Resolution.
Super-Resolution
Several models for Super-Resolution was used for generate the corresponding distortion class. For models where possible, both 2x scaling and 4x scaling options were used.
Hybrid Attention Transformer (HAT)
HAT[4] has 20.8M parameters while keeping a runtime similar to SwinIR. Architecturally, it is a Transformer SR model that combines window-based attention with additional “hybrid” attention components designed to better activate pixels. SwinIR as an evolution of Swin-style restoration Transformer.
This model can be called a state-of-the-art in the field of Video Super-Resolution.
Swin2SR
Swin2SR[5] has ~12M parameters and is positioned as a Swin-Transformer-based SR model aimed at strong quality with moderate size. Architecturally it adapts a hierarchical Swin-style backbone (windowed attention with cross-window interaction) for super-resolution, and Swin2SR emphasizes performance gains over earlier Transformer baselines in several settings.
BasicVSR++
BasicVSR++[6] is a recurrent VSR framework that redesigns BasicVSR with second-order grid propagation and flow-guided deformable alignment to better propagate and align information over time. BasicVSR++ has 7.3M parameters. The BasicVSR++ positions as a strong/top method for VSR while keeping parameter count similar to BasicVSR.
RealBasicVSR
RealBasicVSR[7] has 6.3M parameters and focuses on real-world VSR by adding an image cleaning / input pre-cleaning module in front of a VSR backbone to handle complex degradations. It introduces a training strategy including dynamic refinement to better fit real data distributions and improve robustness.
This model can be called a state-of-the-art in the field of Real-World Video Super-Resolution.
Temporal Modulation Network (TMNet)
TMNet[8] has 12.26M parameters. It is described as integrating a Temporal Modulation Module (TMM) into a Swin-Transformer-based SR network to explicitly modulate temporal consistency for video SR.
Real-ESRGAN
Real-ESRGAN[9] has 16.7M parameters. Method-wise, Real-ESRGAN extends ESRGAN-style real-world SR training with a high-order degradation modeling pipeline and uses a U-Net discriminator to improve perceptual realism and robustness.
BSRGAN
BSRGAN[10] has 16.7M parameters. BSRGAN is best known for its practical degradation model for real-world SR. BSRGAN as a representative approach for real-world SR and retrains SwinIR with the same degradation model for fair comparison. It is commonly used as a real-world SR baseline due to strong qualitative performance under unknown degradations.
SwinIR-Real
SwinIR[11] is a Swin-Transformer-based restoration backbone with shallow feature extraction, deep extraction via Residual Swin Transformer Blocks (RSTB), and an HQ reconstruction head. It explicitly covers real-world SR as one of its evaluated settings. SwinIR has ~897K parameters for its x4 lightweight SR configuration.
Bicubic
Bicubic scaling was used as the Super-Resolution baseline. The FFmpeg utility implementation was used.
Deblurring & Denoising
Several models for Deblurring & Denoising were used to generate the corresponding distortion classes. For these models, video samples were selected that better cover the domains of the task. Videos with a lower VQMT Blurring value were selected for the Deblurring task. For the Denoising task values of TOPIQ NR, PaQ-2-PiQ and CLIPIQA+ metrics were scaled with MeanStd scaling and averaged. Videos with a lower resulting value were selected for the Denoising task.
ClipDenoising
CLIPDenoising[12] is a CLIP-based image denoising approach that uses a frozen encoder and a learnable decoder in an encoder–decoder design. CLIPDenoising has ~19.5M parameters total, broken down as ~8.5M (frozen encoder) + ~11.0M (trainable decoder). It is positioned as a strong denoiser leveraging pretrained semantics (via the frozen encoder) while keeping the trainable part relatively lightweight.
This model can be called a state-of-the-art in the field of Image Generalizable Denoising.
Restormer
Restormer[13] is a Transformer backbone for image restoration that emphasizes channel-dimension interactions rather than patch-token spatial mixing. Restormer configuration is ~26.11M parameters. It is widely used as a strong denoising baseline, and Xformer explicitly contrasts itself by adding stronger spatial–channel dual-branch modeling.
This model can be called a state-of-the-art in the field of Image Denoising & Deblurring.
PVDNet
PVDNet[14] is a recurrent video deblurring method built around blur-invariant motion estimation plus a pixel-volume representation to aggregate information across frames. The model parameters breaks down into: BIMNet ~5.1M and PVDNet ~5.4M, totaling ~10.5M parameters for the full system. This design targets robust motion handling under blur while keeping the overall model relatively compact.
MIMO-UNet
MIMO-UNet[15] is a single U-shaped deblurring network designed to mimic coarse-to-fine cascades efficiently using multi-scale inputs (multi-input single encoder), multi-scale outputs (multi-output single decoder), and asymmetric feature fusion for merging cross-scale features. The model offers speed/efficiency gains versus stacked coarse-to-fine pipelines while improving accuracy on GoPro/RealBlur. MIMO-UNet has 6.81M parameters.
MPRNet
MPRNet[16] is a multi-stage progressive restoration network: multiple encoder–decoder stages refine the output sequentially, and the final stage uses an ORSNet refinement module built from ORB/CAB blocks. The model varies channel width by task (denoising/deblurring). The main configuration is around the ~20M-parameter range. Architecturally, the key distinguishing features are stage-wise refinement plus task-scaled width and dedicated final refinement.
HINet
HINet[17] introduces a Half Instance Normalization (HIN) block, applying instance normalization to only part of the feature channels to stabilize low-level restoration without over-normalizing. The proposed HINet is a two-subnetwork multi-stage design and is reports winning NTIRE 2021 Image Deblurring Challenge – Track 2 (JPEG Artifacts). HINet has 88.7M parameters.
DeepRFT
DeepRFT[18] is a plug-and-play FFT–ReLU frequency-selection block that you insert into standard residual blocks to inject kernel-level cues into deblurring backbones. In the setup NAFNet-32 + DeepRFT has 17.8M params. The paper additionally describes adding a 1x1 convolution to modulate flexible frequency-selection thresholds and learning a joint frequency–spatial (dual-domain) representation.
Uformer
Uformer[19] is a U-shaped (encoder–decoder) Transformer built around Locally-enhanced Window self-attention plus a multi-scale restoration modulator to adapt features across decoder scales. The Uformer has ~50.9M parameters. Its LeWin block performs non-overlapping window self-attention to reduce complexity on high-resolution feature maps, while the restoration modulator is implemented as a learnable multi-scale spatial bias with marginal overhead.
MIRNetv2
MIRNetv2[20] is a lightweight multi-scale restoration network designed to keep high-resolution spatial detail while injecting contextual multi-scale features. MIRNetv2 has ~5.9M parameters. The model is often used as a strong and efficient baseline. The core block uses parallel multi-resolution convolution streams with cross-scale information exchange and attention-based multi-scale feature aggregation, and also includes a non-local attention mechanism to capture broader context.
BM3D
BM3D[21] is a classical (non-neural) denoiser with 0 learned parameters: it groups similar patches into 3D stacks and applies collaborative Wiener filtering in a transform domain. It is still a gold-standard baseline for AWGN-style denoising. The algorithm is typically run in two stages, first producing a basic estimate via collaborative hard-thresholding and aggregation, then refining it by regrouping and applying collaborative Wiener filtering guided by the basic estimate.
References
- Chen L., “Simple Baselines for Image Restoration,” in ECCV, 2022. GitHub: megvii-research/NAFNet
- Liang J., “VRT: A Video Restoration Transformer,” in IEEE, 2022. GitHub: JingyunLiang/VRT
- Liang J., “Recurrent Video Restoration Transformer with Guided Deformable Attention,” in NeurIPS, 2022. GitHub: JingyunLiang/RVRT
- Chen X., “Activating More Pixels in Image Super-Resolution Transformer,” in CVPR, 2023. GitHub: XPixelGroup/HAT
- Conde M. V, “Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration,” in ECCV, 2022. GitHub: mv-lab/swin2sr
- Chan K.C.K., “BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment,” in CVPR, 2022. GitHub: ckkelvinchan/BasicVSR_PlusPlus
- Chan K.C.K., “Investigating Tradeoffs in Real-World Video Super-Resolution,” in CVPR, 2022. GitHub: ckkelvinchan/RealBasicVSR
- Gang X., “Temporal Modulation Network for Controllable Space-Time Video Super-Resolution,” in CVPR, 2021. GitHub: CS-GangXu/TMNet
- Xintao W., “Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data,” in ICCVW, 2021. GitHub: xinntao/Real-ESRGAN
- Zhang K., “Designing a Practical Degradation Model for Deep Blind Image Super-Resolution,” in ECCV, 2021. GitHub: cszn/BSRGAN
- Liang J., “SwinIR: Image Restoration Using Swin Transformer,” in ECCV, 2021. GitHub: JingyunLiang/SwinIR
- Cheng J., “Transfer CLIP for Generalizable Image Denoising,” in CVPR, 2024. GitHub: alwaysuu/CLIPDenoising
- Zamir S.W., “Restormer: Efficient Transformer for High-Resolution Image Restoration,” in CVPR, 2022. GitHub: swz30/Restormer
- Hyeongseok S., “Recurrent Video Deblurring with Blur-Invariant Motion Estimation and Pixel Volumes,” in ACM TOG, 2021. GitHub: codeslake/PVDNet
- Sung-Jin C., “Rethinking Coarse-to-Fine Approach in Single Image Deblurring,” in ECCV, 2021. GitHub: chosj95/MIMO-UNet
- Syed W.Z., “Multi-Stage Progressive Image Restoration,” in CVPR, 2021. GitHub: swz30/MPRNet
- Chen L., “HINet: Half Instance Normalization Network for Image Restoration,” in CVPR, 2021. GitHub: megvii-model/HINet
- Mao X., “Intriguing Findings of Frequency Selection for Image Deblurring,” in AAAI, 2023. GitHub: DeepMed-Lab-ECNU/DeepRFT-AAAI2023
- Wang Z., “Uformer: A General U-Shaped Transformer for Image Restoration,” in CVPR, 2022. GitHub: ZhendongWang6/Uformer
- Zamir S.W., “Learning Enriched Features for Fast Image Restoration and Enhancement,” in TPAMI, 2022. GitHub: swz30/MIRNetv2
- Dabov K., “Image Denoising by Sparse 3-D Transform-Domain Collaborative Filtering,” in IEEE, 2007. GitHub: gfacciol/bm3d
-
MSU Benchmark Collection
- Super-Resolution Quality Metrics Benchmark
- Super-Resolution Quality Metrics Benchmark
- Video Colorization Benchmark
- Video Saliency Prediction Benchmark
- LEHA-CVQAD Video Quality Metrics Benchmark
- Learning-Based Image Compression Benchmark
- Super-Resolution for Video Compression Benchmark
- Defenses for Image Quality Metrics Benchmark
- Deinterlacer Benchmark
- Metrics Robustness Benchmark
- Video Upscalers Benchmark
- Video Deblurring Benchmark
- Video Frame Interpolation Benchmark
- HDR Video Reconstruction Benchmark
- No-Reference Video Quality Metrics Benchmark
- Full-Reference Video Quality Metrics Benchmark
- Video Alignment and Retrieval Benchmark
- Mobile Video Codecs Benchmark
- Video Super-Resolution Benchmark
- Shot Boundary Detection Benchmark
- The VideoMatting Project
- Video Completion
- Codecs Comparisons & Optimization
- VQMT
- MSU Datasets Collection
- Metrics Research
- Video Quality Measurement Tool 3D
- Video Filters
- Other Projects