Google Research hosted The 3rd YouTube-8M Video Understanding Challenge that asks participants to localize video-level labels to the precise time in the video where the label actually appears, and do it at an unprecedented scale. Our (Team Ceshine) mixture of context-aware and context-agnostic segment classifiers won 7th place in the final leaderboard with a relative low budget, which was $150 GCP credit and one additional local GTX1070 card during the whole process.
There are on average 237 annotated segments per class, which is generally considered to be too few to train even moderately sophisticated models. Therefore, a transfer learning approach was adopted to avoid overfitting and improve generalization. The video label prediction task from the previous year’s YouTube-8M challenge was used to pre-train video-level models. Further fine-tuning on the segments dataset helps the model to more accurately pinpoint relevant frames in shorter segments.
Directly fine-tuned models are context-agnostic, as they have no information about the other parts of the video. However, for some classes, such context information can be used to make better predictions. To make use of this information, context-aware models are created by combining a video encoder and a segment encoder. The video and segment embedding vectors from the two encoders are concatenated together to predict class probabilities.
By using a mixture of context-aware and context-agnostic segment classification models, different characteristics of the 1,000 target classes are be better accommodated, and thus improves overall performance.
The technology used in this project: