Contrastive Multi-Modal Video Transformer


Transformer networks have shown great promise in video classification and understanding tasks by reducing the dependency on recurrent networks, and instead using self-attention techniques. Recurrent techniques are often not suitable for videos/data with long-term temporal dependencies due to the vanishing gradient problem, as well as the inability to fully backpropagate due to computational power restrictions. By using self-attention, a neural network can learn long-term dependencies with lower computational requirements and higher accuracy. We aim to increase the performance of self-supervised video transformer networks by contrasting frame-level local representations with the video-level global representations and determining how contrasting different representations from different data modalities may increase accuracy.

View Slides