Towards Fine-Tuning of VQA Models in Public Datasets

Date

November 19, 2020

Source

Workshop of Physical Agents (WAF)

Authors

Miguel E. Ortiz
Luis M. Bergasa
Roberto Arroyo
Sergio Alvarez Pardo
Aitor Aller

Abstract

This paper studies the Visual Question Answering (VQA) topic, which combines Computer Vision (CV), Natural Language Processing (NLP) and Knowledge Representation & Reasoning (KR&R) in order to automatically provide natural language responses to questions asked by users over images. A review of the state of the art for this technology is initially carried out. Among the different approaches, we select the model known as Pythia to build upon it, because this approach is one of the most popularized and successful methods in the public VQA Challenge. Recently, an exhaustive breakdown was done to the Pythia code by Facebook AI Research (FAIR). We choose to use this updated framework after confirming that the two implementations had analog characteristics. We introduce the different modules of the FAIR implementation and how to train our model, proposing some improvements regarding the baseline. Different fine-tuned models are trained, obtaining an accuracy of 66.22% in the best case for the test set of the public VQA-v2 dataset. A comparison of the quantitative results for the most important experiments jointly some qualitative results are discussed. This experimentation is performed with the aim of finally applying it to eCommerce and store observation use cases for VQA in further research.