Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework
Date
November 2025
Source
AAAI’26 The 40th Annual AAAI Conference on Artificial Intelligence (main conference track)
Authors
Diego Ortego
Marlon Rodríguez
Mario Almagro
Kunal Dahiya
David Jiménez
Juan C. SanMiguel
Summary of Our Research
Unlocking the Power of Foundation Models in Extreme Multi-Label Classification (XMC)
Foundation models have transformed AI, but their potential in XMC, where queries must be matched with multiple items from an extremely large label space, remains underexplored.
Our research introduces ViXML, a novel framework that combines large-scale decoder models with visual information to boost accuracy without sacrificing efficiency. By integrating vision embeddings and leveraging multi-modal data, ViXML achieves state-of-the-art performance, outperforming text-only approaches by up to +8.21% in P@1 on the largest benchmark. This work demonstrates that incorporating images alongside text can deliver significant gains, paving the way for more powerful and scalable solutions in real-world product recommendation tasks.