Overall framework
This work studies long-horizon raw gaze prediction for videos. It models continuous gaze trajectories with high-resolution timestamps using an autoregressive diffusion model conditioned on a saliency-aware visual latent space, improving long-range spatiotemporal accuracy and realism over short-window gaze prediction methods.