Deep Learning of Visual and Textual Data for Region Detection Applied to Item Coding

Date

July 1, 2019

Source

Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA)

Authors

Roberto Arroyo
Javier Tovar
Francisco J. Delgado
Emilio Almazan
Javier Tovar
Alejandro de la Calle

Abstract

In this work, we propose a deep learning approach that combines visual appearance and text information in a Convolutional Neural Network (CNN), with the aim of detecting regions of different textual categories. We define a novel visual representation of the semantic meaning of text that allows a seamless integration in a standard CNN architecture. This representation, referred to as text-map, is integrated with the actual image to provide a much richer input to the network. Text-maps are colored with different intensities depending on the relevance of the words recognized over the image. More specifically, these words are previously extracted using Optical Character Recognition (OCR) and they are colored according to the probability of belonging to a textual category of interest. In this sense, the presented solution is especially relevant in the context of item coding for supermarket products, where different types of textual categories must be identified (e.g., ingredients or nutritional facts). We evaluated our approach in the proprietary item coding dataset of Nielsen Brandbank, which is composed of more than 10,000 images for train and 2,000 images for test. The reported results demonstrate that our method focused on visual and textual data outperforms state-of-the-art algorithms only based on appearance, such as standard Faster R-CNN. These improvements are exhibited in precision and recall, which are enhanced in 42 and 33 points respectively.