The List of the Participants of the Subjective QRS Benchmark

Multimodal Distortion

Several models were used capable of solving several tasks at once, such as Super-Resolution, Deblurring and Denoising. To start one of the operating modes, it was necessary to specify the appropriate parameters.

Nonlinear Activation Free Network (NAFNet)

NAFNet^[1] is a modern CNN restoration baseline that deliberately simplifies the building block by removing conventional nonlinear activations, relying instead on lightweight gating/attention-style operations inside a U-Net-like framework. NAFNet has ~29M parameters for its main model variant used in the evaluations. It is notable for achieving strong restoration quality with a comparatively simple and efficient architecture competitive with transformer backbones.

Video Restoration Transformer (VRT)

VRT^[2] is a multi-scale video Transformer built from Temporal Mutual Self-Attention (TMSA) plus parallel warping. TMSA operates on short clips (mutual attention for alignment/fusion + self-attention for feature extraction) and uses sequence shifting to enable cross-clip interaction. VRT has 18.3M parameters for a 1280×720 LQ setting, and also highlights that VRT is comparatively heavy in memory/runtime versus recurrent designs.

Recurrent Video Restoration Transformer (RVRT)

RVRT^[3] integrates parallel and recurrent strategies: it processes local clips in parallel inside a globally recurrent framework, and introduces Guided Deformable Attention (GDA) for clip-to-clip alignment/aggregation under flow guidance. RVRT has 13.6M parameters, while also noting that it achieves a strong trade-off between effectiveness, memory, and speed.

This model can be called a state-of-the-art in the field of Synthetic Video Super-Resolution.

Super-Resolution

Several models for Super-Resolution was used for generate the corresponding distortion class. For models where possible, both 2x scaling and 4x scaling options were used.

Hybrid Attention Transformer (HAT)

HAT^[4] has 20.8M parameters while keeping a runtime similar to SwinIR. Architecturally, it is a Transformer SR model that combines window-based attention with additional “hybrid” attention components designed to better activate pixels. SwinIR as an evolution of Swin-style restoration Transformer.

This model can be called a state-of-the-art in the field of Video Super-Resolution.

Swin2SR

Swin2SR^[5] has ~12M parameters and is positioned as a Swin-Transformer-based SR model aimed at strong quality with moderate size. Architecturally it adapts a hierarchical Swin-style backbone (windowed attention with cross-window interaction) for super-resolution, and Swin2SR emphasizes performance gains over earlier Transformer baselines in several settings.

BasicVSR++

BasicVSR++^[6] is a recurrent VSR framework that redesigns BasicVSR with second-order grid propagation and flow-guided deformable alignment to better propagate and align information over time. BasicVSR++ has 7.3M parameters. The BasicVSR++ positions as a strong/top method for VSR while keeping parameter count similar to BasicVSR.

RealBasicVSR

RealBasicVSR^[7] has 6.3M parameters and focuses on real-world VSR by adding an image cleaning / input pre-cleaning module in front of a VSR backbone to handle complex degradations. It introduces a training strategy including dynamic refinement to better fit real data distributions and improve robustness.

This model can be called a state-of-the-art in the field of Real-World Video Super-Resolution.

Temporal Modulation Network (TMNet)

TMNet^[8] has 12.26M parameters. It is described as integrating a Temporal Modulation Module (TMM) into a Swin-Transformer-based SR network to explicitly modulate temporal consistency for video SR.

Real-ESRGAN

Real-ESRGAN^[9] has 16.7M parameters. Method-wise, Real-ESRGAN extends ESRGAN-style real-world SR training with a high-order degradation modeling pipeline and uses a U-Net discriminator to improve perceptual realism and robustness.

BSRGAN

BSRGAN^[10] has 16.7M parameters. BSRGAN is best known for its practical degradation model for real-world SR. BSRGAN as a representative approach for real-world SR and retrains SwinIR with the same degradation model for fair comparison. It is commonly used as a real-world SR baseline due to strong qualitative performance under unknown degradations.

SwinIR-Real

SwinIR^[11] is a Swin-Transformer-based restoration backbone with shallow feature extraction, deep extraction via Residual Swin Transformer Blocks (RSTB), and an HQ reconstruction head. It explicitly covers real-world SR as one of its evaluated settings. SwinIR has ~897K parameters for its x4 lightweight SR configuration.

Bicubic

Bicubic scaling was used as the Super-Resolution baseline. The FFmpeg utility implementation was used.

Deblurring & Denoising

Several models for Deblurring & Denoising were used to generate the corresponding distortion classes. For these models, video samples were selected that better cover the domains of the task. Videos with a lower VQMT Blurring value were selected for the Deblurring task. For the Denoising task values of TOPIQ NR, PaQ-2-PiQ and CLIPIQA+ metrics were scaled with MeanStd scaling and averaged. Videos with a lower resulting value were selected for the Denoising task.

ClipDenoising

CLIPDenoising^[12] is a CLIP-based image denoising approach that uses a frozen encoder and a learnable decoder in an encoder–decoder design. CLIPDenoising has ~19.5M parameters total, broken down as ~8.5M (frozen encoder) + ~11.0M (trainable decoder). It is positioned as a strong denoiser leveraging pretrained semantics (via the frozen encoder) while keeping the trainable part relatively lightweight.

This model can be called a state-of-the-art in the field of Image Generalizable Denoising.

Restormer

Restormer^[13] is a Transformer backbone for image restoration that emphasizes channel-dimension interactions rather than patch-token spatial mixing. Restormer configuration is ~26.11M parameters. It is widely used as a strong denoising baseline, and Xformer explicitly contrasts itself by adding stronger spatial–channel dual-branch modeling.

This model can be called a state-of-the-art in the field of Image Denoising & Deblurring.

PVDNet

PVDNet^[14] is a recurrent video deblurring method built around blur-invariant motion estimation plus a pixel-volume representation to aggregate information across frames. The model parameters breaks down into: BIMNet ~5.1M and PVDNet ~5.4M, totaling ~10.5M parameters for the full system. This design targets robust motion handling under blur while keeping the overall model relatively compact.

MIMO-UNet

MIMO-UNet^[15] is a single U-shaped deblurring network designed to mimic coarse-to-fine cascades efficiently using multi-scale inputs (multi-input single encoder), multi-scale outputs (multi-output single decoder), and asymmetric feature fusion for merging cross-scale features. The model offers speed/efficiency gains versus stacked coarse-to-fine pipelines while improving accuracy on GoPro/RealBlur. MIMO-UNet has 6.81M parameters.

MPRNet

MPRNet^[16] is a multi-stage progressive restoration network: multiple encoder–decoder stages refine the output sequentially, and the final stage uses an ORSNet refinement module built from ORB/CAB blocks. The model varies channel width by task (denoising/deblurring). The main configuration is around the ~20M-parameter range. Architecturally, the key distinguishing features are stage-wise refinement plus task-scaled width and dedicated final refinement.

HINet

HINet^[17] introduces a Half Instance Normalization (HIN) block, applying instance normalization to only part of the feature channels to stabilize low-level restoration without over-normalizing. The proposed HINet is a two-subnetwork multi-stage design and is reports winning NTIRE 2021 Image Deblurring Challenge – Track 2 (JPEG Artifacts). HINet has 88.7M parameters.

DeepRFT

DeepRFT^[18] is a plug-and-play FFT–ReLU frequency-selection block that you insert into standard residual blocks to inject kernel-level cues into deblurring backbones. In the setup NAFNet-32 + DeepRFT has 17.8M params. The paper additionally describes adding a 1x1 convolution to modulate flexible frequency-selection thresholds and learning a joint frequency–spatial (dual-domain) representation.

Uformer

Uformer^[19] is a U-shaped (encoder–decoder) Transformer built around Locally-enhanced Window self-attention plus a multi-scale restoration modulator to adapt features across decoder scales. The Uformer has ~50.9M parameters. Its LeWin block performs non-overlapping window self-attention to reduce complexity on high-resolution feature maps, while the restoration modulator is implemented as a learnable multi-scale spatial bias with marginal overhead.

MIRNetv2

MIRNetv2^[20] is a lightweight multi-scale restoration network designed to keep high-resolution spatial detail while injecting contextual multi-scale features. MIRNetv2 has ~5.9M parameters. The model is often used as a strong and efficient baseline. The core block uses parallel multi-resolution convolution streams with cross-scale information exchange and attention-based multi-scale feature aggregation, and also includes a non-local attention mechanism to capture broader context.

BM3D

BM3D^[21] is a classical (non-neural) denoiser with 0 learned parameters: it groups similar patches into 3D stacks and applies collaborative Wiener filtering in a transform domain. It is still a gold-standard baseline for AWGN-style denoising. The algorithm is typically run in two stages, first producing a basic estimate via collaborative hard-thresholding and aggregation, then refining it by regrouping and applying collaborative Wiener filtering guided by the basic estimate.

References

02 Mar 2026

Video processing, compression and quality research group Based in MSU Graphics & Media Laboratory