RAG Information Overload

Retrieval Augmented Generation (RAG) is getting easier by the minute, I can’t keep up with the daily influx of new tools in the space. My initial experience with ChatGPT was not positive - I couldn’t believe how useless the tool was. The “hallucinations” were really what threw me off - the propensity for the LLM to just make up random information was just too high.

RAG changed all of this for me. With the ability to train the AI on my sources of truth, all of a sudden this became an indispensable tool. I really was blown away by the power of RAG to shift the functional utility of LLMs.

Today, I noticed something funny - our customer support AI seemed to have forgotten its training. It was now getting questions “wrong” that previously it had an amazing track record of answering.

The culprit? Information overload.

Originally we only trained pSai on the polySpectra documentation website and the product pages from our e-commerce store — in part because I wanted it to just be able to answer basic product questions, and in part because ran into a technical difficulty getting it to scrape the entire polyspectra website when I first set it up. Recently, I figured out how to train it on the entire polySpectra.com website, which at first seemed like a good thing.

Unexpectedly, more became less. This extra information was the RAG that broke the camel’s back. There were just enough conflicting sources of truth in the AI’s verified sources to confuse it. Where before it was doing an amazing job, now it is giving the wrong answer.

Without the sources of truth, LLMs are pretty useless customer support agents. With just enough information, they are surprisingly good. With too much, they become unhelpful again.

The role of the humans in this situation is clearly to maintain a single source of truth. It makes me wonder how many humans we’ve confused with our website over the years…much more to distill and refine.

It also makes me wonder what my “context window” is. How many things do I get confused, by having access to too many sources of information?