Overall framework
ViSAudio tackles end-to-end binaural spatial audio generation directly from silent video. It introduces the BiAudio dataset and a conditional flow matching architecture with dual audio branches and a conditional spacetime module for spatially consistent audio generation.