On the Theoretical Limitations of Embedding-Based Retrieval

## Overview ![[retrieval-capabilities.png]] >[!summary] > There is a fundamental limit to vector based embedding capacity - for large corpus, standard non dense techniques such as BM25 scale and perform far better. >[!question] > What is the theoretical limit of embedding retrievals? Can it exceed sparse technique such as BM25? >[!idea] > Establish a theoretical model and designed a principled datasets based of it to answer the question ## 🔮Insights >[!insight] > Scaling is dependent on the embeddings size and event at very large size (4096 dimensions), the scaling of embedding is not great. BM25 and other sparse techniques remains far more accurate. >[!limitation] > In practice not all combinations are useful for retrieval so in practice the capacity limit is probably higher ## 🧭 Topic Compass ### Where Does X come from? ### What is similar to X? ### What compete with X? Sparse Techniques such as BM25 ### Where can X lead To? Rethinking the whole vector semantic search as the new paradigm that was pushed starting 2023 ## 📖 References ### **Paper** url: https://arxiv.org/abs/2508.21038v1 ![[2509.04664v1.pdf]]