TL;DR – Voyage is a team of leading AI researchers, dedicated to enabling teams to build better RAG applications. Today, we’re releasing a new state-of-the-art embedding model and API, which already beats public models, like OpenAI’s text embeddings, with more to come soon. If you’re excited about custom/fine-tuned embeddings with further enhanced retrieval accuracy, please reach out to contact@voyageai.com for early access.

Quality embeddings drive RAG performance

Have you ever heard folks discussing the importance of embedding models in RAG? And right alongside that, some not-so-great feedback about OpenAI's ada embedding model? If you're curious about why, let's break it down.

Suppose you’re building a chatbot using RAG for a specific domain, e.g., finance. Given a query, e.g., “what is the average price of Apple, Inc., in the last 3 months”, the chatbot first retrieves some relevant docs using embedding models and vectorDB, and then puts these docs and the query into GPT-4 to generate a response.

Now, the efficacy of the bot will rest on the accuracy / relevance of the docs it retrieves.

Should it retrieve the stock price records of Apple, Inc., GPT-4 is more than smart enough to synthesize an accurate answer—even GPT-3.5 can perhaps do. But if it grabs the price tags for  apples at Whole Foods? That's a classical situation when GPT-4 hallucinates (or says “I don’t know” sometimes).

What determines the retrieval relevance/accuracy? Embeddings. As the representations or indices of the docs, they are responsible for differentiating "Apple, Inc." from the fruit and recognizing that stock prices relate more to "Apple Inc." than Whole Foods does.

This technical post discusses the effect of embeddings to RAG with a more detailed analysis on a case study of the chat.langchain bot, which uses Voyage’s embedding models.  

Key point: RAG's success hinges on the relevance of retrieved docs, which depends on the quality of the embeddings.

Voyage trains best-in-class embeddings models

Embedding models are Transformers, and training them requires many pieces, such as, model architecture, data collection, pre-processing and selections, suitable loss functions and optimizers, and efficient implementations.

Unlike LLMs which are trained with next-word prediction loss, embeddings models are trained with “contrastive learning”, which involves augmenting the data with the “positive pairs”---pairs of data that are semantically related—and using them as the training signal. Voyage’s team has 5 researchers who had done extensive research on embeddings in the past 5+ years and published many cutting-edge papers on various components in training embeddings, e.g., paper 1, 2, 3, 4, and 5, to name a few. Voyage’s embeddings leverage novel architectures, faster optimizers (e.g., Sophia), a massive dataset with systematic collection and pre-processing (e.g., DoReMi), and proprietary methods on contrastive learning.

State-of-the-art retrieval accuracy on MTEB and real-world industry data

The most commonly-used benchmark for evaluating embeddings is the retrieval tasks on the HuggingFace MTEB leaderboard. The benchmark is unfortunately a bit overused these days. Thus, for comprehensiveness, we also build nine new datasets that cover a range of real-world industry domains, such as technical documentations, reviews, and news, called real-world industry domains (RWID). We use the standard NDCG@10 metric for RWID following the convention in MTEB.

The takeaway from the table below:

  • Voyage is the state-of-the-art on both MTEB and RWID.
  • BGE-large is much weaker than Voyage and OpenAI on RWID dataset, but is a close second on MTEB, which suggests that it might possibly overfit to MTEB.

Domain-specific or company-specific embeddings

Real-world scenarios are always more challenging than academic benchmarks because they are often “out-of-domain” for the training data. Each industry has its unique terminology and knowledge base, just as every enterprise does. Furthermore, every individual user has distinct styles and preferences. Voyage's embeddings outperform others right out-of-the-box (as seen in the table above); however, further customization and adaptation can yield higher quality or reduced costs.

Voyage offers embedding models tailored for coding and finance, with more domains on the horizon. We also provide embeddings fine-tuned on small, unlabeled company-specific datasets, achieving a consistent 10-20% accuracy boost for early pilot customers such as LangChain, OneSignal, Druva, and Galpha.  

Interested in early access and an accuracy boost? Email contact@voyagei.com. Follow up on twitter and/or linkedin for more updates!