Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework

Date

November 2025

Source

AAAI’26 The 40th Annual AAAI Conference on Artificial Intelligence (main conference track)

Authors

Diego Ortego

Marlon Rodríguez

Mario Almagro

Kunal Dahiya

David Jiménez

Juan C. SanMiguel

Summary of Our Research

Unlocking the Power of Foundation Models in Extreme Multi-Label Classification (XMC)
Foundation models have transformed AI, but their potential in XMC, where queries must be matched with multiple items from an extremely large label space, remains underexplored.

Our research introduces ViXML, a novel framework that combines large-scale decoder models with visual information to boost accuracy without sacrificing efficiency. By integrating vision embeddings and leveraging multi-modal data, ViXML achieves state-of-the-art performance, outperforming text-only approaches by up to +8.21% in P@1 on the largest benchmark. This work demonstrates that incorporating images alongside text can deliver significant gains, paving the way for more powerful and scalable solutions in real-world product recommendation tasks.