Fine-tuning your answers: A bag of tricks for improving VQA models


January 8, 2022


Multimedia Tools And Applications (MTAP)


Roberto Arroyo
Sergio Álvarez Pardo
Aitor Aller
Luís M. Bergasa
Miguel E. Ortiz


In this paper, one of the most novel topics in Deep Learning (DL) is explored: Visual Question Answering (VQA). This research area uses three of the most important fields in Artificial Intelligence (AI) to automatically provide natural language answers for questions that a user can ask about an image. These fields are: 1) Computer Vision (CV), 2) Natural Language Processing (NLP) and 3) Knowledge Representation & Reasoning (KR&R). Initially, a review of the state of art in VQA and our contributions to it are discussed. Then, we build upon the ideas provided by Pythia, which is one of the most outstanding approaches. Therefore, a study of the Pythia’s architecture is carried out with the aim of presenting varied enhancements with respect to the original proposal in order to fine-tune models using a bag of tricks. Several training strategies are compared to increase the global accuracy and understand the limitations associated with VQA models. Extended results check the impact of the different tricks over our enhanced architecture, jointly with additional qualitative results.