G60917.mp4
: Applying transformer architectures to video recognition.
The video is used to help AI understand "visual common sense"—for example, knowing that an object will fall if pushed off an edge [2, 5]. Common Research Uses g60917.mp4
: Learning temporal aspects of video via self-attention. : Applying transformer architectures to video recognition
by Raghav Goyal, Samira Ebrahimi Kahou, Raul Vazquez, Christian Rousseau, Nicolas Ballas, Laurent Charlin, and Roland Memisevic (2017) [2, 5]. Context of the Video by Raghav Goyal, Samira Ebrahimi Kahou, Raul Vazquez,
In this dataset, "g60917.mp4" typically represents a specific label, such as "Pushing [something] so that it falls off the table" or a similar interaction, depending on the specific version's indexing [1, 4].
The primary research paper associated with this dataset and its corresponding video files is:
Something-Something V2, which contains over 220,000 video clips [3].