The Methodology of the MSU Video Saliency Prediction Benchmark
Data Collection
Our dataset consists of 41 clips and contains:
 28 movie fragment clips
 10 sport stream clips
 3 live caption clips
The resolution of our clips is 1920×1080.
The duration of the clips is from 13 to 38 seconds.
50 observers (19–24 y. o.) watched each video with crossfade by adding a black frame between scenes in freeview task mode and the eyetracker collected their fixations. Crossfade ensures the independence of the received fixations between different clips. The SMI iViewXTM HiSpeed 1250 with a frequency of 500 Hz (~20 fixations per frame) was used.
The final groundtruth saliency map was estimated as a Gaussian mixture with centers at the fixation points. A standard deviation for the Gaussians equal to 120 was chosen (this value matches 8 angular degrees, which is known to be the sector of sharp vision).
According to the graph, we can conclude that our dataset represents a domain of a different distribution with a diverse amount of movement on evenly spaced scenes. For comparison, we used the VQMT implementation of SI/TI metrics.
Metrics
We hide these similar models by default, and you can view them using checkboxes.
We compute 3 metrics using Ground Truth Gaussians: Similarity (SIM), Linear Correlation Coefficient (CC)
and KullbackLeibler divergence (KLDiv).
Also, we compute 2 metrics using Ground Truth fixations: Normalized Scanpath Saliency (NSS)
and Area Under the Curve — Judd (AUC Judd): evaluates a saliency map’s predictive power by how many ground truth fixations it captures in successive level sets.
We run saliency models on NVIDIA GTX 1060. Usually, saliency models utilize only GPU for neural network inference, but some also use CPU. We run saliency algorithms on Intel Core i57300HQ @ 2.50GHz. All models were tested on Linux.
Domain Adaptation
Saliency maps predicted by various visualattention models can be significantly improved using sequences of simple postprocessing transformations: brightness correction and blending with the Center Prior.
The relevance of brightness correction is explained by the need to bring the obtained saliency maps to one form in the sense of the parameters of blurring fixations into Gaussians.
Such a difference may manifest itself simply because different models used different ways of blurring Ground Truth fixations for training (although the data itself could be the same).
Thus, brightness correction adapts the size of domain predictions.
Blending with Center Prior adapts the shape and location of domain predictions, giving the desired predictions the appearance of a Gaussian distribution and centering their location.
The need to use domain adaptation can be justified by differences in the average Gaussians of some other popular datasets (there will be original pictures with SAVAM average):
For our evaluation, we used a method^{[1]} that for any saliency model can find the exact globally optimal blending weight and brightnesscorrection function simultaneously in terms of MSE.
Let \(\textbf{SM}^i\) and \(\textbf{GT}^i\) be the respective predicted and Ground Truth saliency maps for the \(i^\text{th}\) frame,
and let \(\textbf{CP}\) be a precomputed Center Prior image. The blending weight is denoted by \(\beta\), and \(p \in \mathbb{R}^2\) is the pixel position. The cost function is then
\[C(\beta, M) = \sum_{i, p} (M(\textbf{SM}^i_p) + \beta\textbf{CP}_p  \textbf{GT}^i_p)^2,\]
where \(M : \mathbb{R}^+ \to \mathbb{R}^+\) is the brightnesscorrection function. Since we store saliency maps as 8bit images, we can represent
this function by the vector \(\textbf{m} \in \mathbb{R}^N\), \(N = 256\), such that \(M(s) = \textbf{m}_s \in \mathbb{R}^+\) for any saliency value \(s \in \mathbb{N}^+\), with
\(s < N\). Then we can consider \(C\) to be the quadratic function of realvalued arguments \(\beta\) and \(\textbf{m}\), so the solution of the
following quadratic programming problem should yield the optimal parameter values:
\[(\beta, \textbf{m}) = \underset{0 < \textbf{m}_i < \textbf{m}_{i+1}}{\underset{\beta > 0}{\operatorname{arg min}}} \sum_{i, p} (\textbf{m}_{\textbf{SM}^i_p} + \beta\textbf{CP}_p  \textbf{GT}^i_p)^2.\]
The constraints guarantee that the brightnesscorrection function \(\textbf{m} \in \mathbb{R}^N\) is monotone. This task has a canonical matrix form:
\[(\beta, \textbf{m}) = \underset{0 < \textbf{x}_i < \textbf{x}_{i + 1}, i > 1}{\underset{\textbf{x}_1 > 0}{\operatorname{arg min}}} \frac{1}{2}\textbf{x}^T\textbf{Hx} + \textbf{f}^T\textbf{x} + c,\]
where \(\textbf{x} = (\beta, \textbf{m}) \in \mathbb{R}^{N + 1}\) contains the target parameters; matrix \(\textbf{H} \in \mathbb{R}^{(N + 1) \times (N + 1)}\),
vector \(\textbf{f} \in \mathbb{R}^{N + 1}\) and scalar \(c\) define the optimization task. \(\textbf{H}\) is a Hermitian sparse matrix containing nonzero elements only in the first row, first column, and main diagonal.
Moreover, \(\textbf{H}\) is a positivedefinite matrix, so the quadratic programming task is convex and easily solved using any one of numerous approaches (we use the interiorpoint method).
The implementation of metrics calculating and domain adaptation is available at our github.
Subscribe to this benchmark's updates using the form and get notified when the paper is available.
References

MSU Benchmark Collection
 Video Upscalers Benchmark
 Video Deblurring Benchmark
 Video Frame Interpolation Benchmark
 HDR Video Reconstruction Benchmark
 SuperResolution for Video Compression Benchmark
 NoReference Video Quality Metrics Benchmark
 FullReference Video Quality Metrics Benchmark
 Video Alignment and Retrieval Benchmark
 Mobile Video Codecs Benchmark
 Video SuperResolution Benchmark
 Shot Boundary Detection Benchmark
 Deinterlacer Benchmark
 The VideoMatting Project
 Video Completion
 Codecs Comparisons & Optimization
 VQMT
 MSU Datasets Collection
 Metrics Research
 Video Quality Measurement Tool 3D
 Video Filters
 Other Projects