The Methodology of the CrowdSAL Dataset and Benchmark

All implementations used for these calculations are available on our GitHub!
Data Collection
Our dataset consists of 5000 clips. The resolution of our clips is 1920×1080 and 1080×1920. Each video was viewed by more than 75 observers. The final saliency map was estimated as a Gaussian mixture with centers at the fixation points. We set the Gaussian standard deviation to 57.6 pixels. Under standard viewing conditions, with a viewing distance of 60 cm and a screen width of 35 cm / 1920 px, this value corresponds to approximately 1 degree of visual angle.

As shown in the figure, our CrowdSAL dataset has a more diverse distribution of motion and spatial information parameters than DHF1K and Hollywood-2, which contain 1,000 and 1,707 videos, respectively. For comparison, we used the VQMT[1] implementation of SI/TI metrics.
Metrics[2]
We compute 2 metrics using Ground Truth Gaussians: Linear Correlation Coefficient (CC) and Similarity (SIM). \[ \begin{aligned} &\mathrm{CC}(P,Q^{D}) = \frac{\sigma(P,Q^{D})}{\sigma(P) \times \sigma(Q^{D})} \\[4pt] &\mathrm{SIM}(P,Q^{D}) = \sum_i \min(P_i,Q^{D}_i), \quad \sum_i P_i = \sum_i Q^{D}_i = 1. \end{aligned} \] Also, we compute 2 metrics using Ground Truth fixations: Normalized Scanpath Saliency (NSS) and Area Under the Curve by Judd (AUC Judd). AUC Judd evaluates a saliency map’s predictive power by how many ground truth fixations it captures in successive level sets. \[ \begin{aligned} &\mathrm{NSS}(P,Q^{B}) = \frac{1}{N}\sum_{i} \bar P_i \times Q^{B}_i, \quad \bar P = \frac{P-\mu(P)}{\sigma(P)}, \quad N=\sum_i Q^{B}_i. \\[2pt] &\mathrm{AUC\text{ }Judd}(S,F) = \sum_{k=1}^{K-1} \frac{\mathrm{TPR}_{k+1}+\mathrm{TPR}_k}{2}\, \bigl(\mathrm{FPR}_{k+1}-\mathrm{FPR}_k\bigr), \\[2pt] &\mathrm{TPR}_k = \frac{\sum_i \mathbf{1}[P_i \ge \tau_k]\; Q^{B}_i}{\sum_i Q^{B}_i}, \quad \mathrm{FPR}_k = \frac{\sum_i \mathbf{1}[P_i \ge \tau_k]\,(1-Q^{B}_i)} {\sum_i (1-Q^{B}_i)}. \end{aligned} \] Notations:
- \(i\) — pixel index,
- \(P\) — predicted saliency map,
- \(Q^{B}_i \in \{0,1\}\) — binary ground truth fixation map,
- \(Q^{D}_i \ge 0\) — continuous ground truth saliency map,
- \(\mu(M)\) — mean value of \(M\),
- \(\sigma(M)\) — standard deviation of \(M\),
- \(\sigma(A,B)\) — covariance between \(A\) and \(B\),
- \(\mathbf{1}[\cdot]\) — indicator function,
- \(K\) — number of distinct thresholds,
- \(\{\tau_k\}_{k=1}^K\) — sorted distinct values of \(P\).
Optimal Sigmas Selection
The balance between the visible foreground region and the blur applied to the background affects how strongly users are encouraged to move the cursor. We conducted a greedy search on the validation benchmark, varying the foreground foveal diameter of the visible region \(\sigma_{fg}\) and the background blur level as Gaussian standard deviation \(\sigma_{bg}\), both as a percentage of the screen width. In total, we tried 17 combinations. All values are computed relative to the participant’s screen width. The mean CC, SIM, and NSS values in the table below indicate that the optimal setting is \(\sigma_{bg} = 0.5\) and \(\sigma_{fg} = 20\). The error intervals are defined as standard deviations.
Participants Validation
The first post-filtering stage uses honeypot videos. These clips were chosen as the three most challenging cases for our methodology. Using the validation benchmark, we determined quality thresholds for these videos and excluded participants whose responses fell below them. The distributions of the mean CC on each individual honeypot video show that the mouse-based saliency signal remains aligned with the original attention pattern after filtering. These participants were still compensated, but their data were removed from the final dataset. Filtering on the validation videos retained 19,632 participants (~80%).
Frequency Analysis
The second post-filtering stage is based on mouse movement frequency. The rationale is that a participant may appear to be watching the video while not actively interacting with the mouse. The distribution shows that most accepted views fall well below 100 Hz. We evaluated several frequency thresholds on the validation benchmark and found that 3 Hz provides the best trade-off. We therefore discard views below 3 Hz and normalize the remaining trajectories to 100 Hz using linear interpolation. Frequency filtering passed 381960 successful views (~93.7%).
Dataset Statistics
The average video length in the CrowdSAL dataset after all transformations is 553 frames (~18.4 seconds). The average time to complete our task in crowdsourcing is less than 15 minutes. In the graphs below, we report statistics for participants who successfully passed all filtering stages.
The accepted participant pool is not dominated by a single gender group. The study mainly reflects adult viewing behavior. Most accepted participants reported normal vision.
The recordings were collected under realistic everyday lighting conditions. The accepted sessions cover a variety of viewing setups. The 1-5 star ratings were collected from participants as post-task feedback after finishing each video.
The accepted data span a broad range of screen resolutions and physical screen sizes. The most common were 1920×1080 px displays with 34×19 cm screens.
Validation with automatic models
The table below compares model generalization across external saliency benchmarks using CC, SIM, NSS, and AUC Judd. The Human Infinite row in the table below estimates the consistency ceiling by comparing one group of human annotations against another. For this estimate, the fixations in each video frame are split into two halves. The table shows that our method outperforms all automatic video saliency models in terms of CC, SIM, and NSS. Moreover, in terms of CC, our method is closer to Human Infinite than to the second-best automatic method. For SIM and NSS, our method remains noticeably closer to Human Infinite than to most automatic methods. This shows that the difference between our method and automatic methods is significant.
For the benchmark on the CrowdSAL dataset we report model size in #Parameters and FPS (frames per second) for all models, measured on a single NVIDIA TESLA A100 80GB GPU.
References
- MSU Video Quality Measurement Tool
- Zoya Bylinskii, Tilke Judd, Aude Oliva, Antonio Torralba, and Frédo Durand. 2018. What do different evaluation metrics tell us about saliency models? IEEE transactions on pattern analysis and machine intelligence 41, 3 (2018), 740–757.
-
MSU Benchmark Collection
- Super-Resolution Quality Metrics Benchmark
- Super-Resolution Quality Metrics Benchmark
- Video Colorization Benchmark
- Video Saliency Prediction Benchmark
- LEHA-CVQAD Video Quality Metrics Benchmark
- Learning-Based Image Compression Benchmark
- Super-Resolution for Video Compression Benchmark
- Defenses for Image Quality Metrics Benchmark
- Deinterlacer Benchmark
- Metrics Robustness Benchmark
- Video Upscalers Benchmark
- Video Deblurring Benchmark
- Video Frame Interpolation Benchmark
- HDR Video Reconstruction Benchmark
- No-Reference Video Quality Metrics Benchmark
- Full-Reference Video Quality Metrics Benchmark
- Video Alignment and Retrieval Benchmark
- Mobile Video Codecs Benchmark
- Video Super-Resolution Benchmark
- Shot Boundary Detection Benchmark
- The VideoMatting Project
- Video Completion
- Codecs Comparisons & Optimization
- VQMT
- MSU Datasets Collection
- Metrics Research
- Video Quality Measurement Tool 3D
- Video Filters
- Other Projects