Summary
Infinitive has finished the first phase of implementing a Retrieval Augmented Generation (RAG) application to help onboard new employees. Infinitive is a data and AI consultancy headquartered in Ashburn, Va – outside Washington, DC. Like most consulting firms, we are continually hiring new employees to meet our business growth. To help new employees understand Infinitive’s internal policies, we decided to create a RAG-based application to allow natural language inquiry of our employee policy documents. This blog describes how we approached the development of the application and how we solved the challenges we faced along the way.
RAG development approach
Infinitive uses a metrics-based development approach for the creation of the RAG-based application. We use this approach for internal application development as well as application development projects we conduct for our paying clients. This approach revolves around quickly building a scaled-back RAG application and using a series of accuracy and completeness metrics to guide how we scale up the application. Those metrics include:
- Retrieval Precision: Retrieval precision is the ratio of relevant documents retrieved to the total number of documents retrieved. It is an indicator of the accuracy of the retrieval system.
- Retrieval Recall: Retrieval recall is the ratio of relevant documents retrieved to the total number of relevant documents in the dataset. It measures the system’s ability to find all relevant documents.
- Retrieval F1 Score: The F1 score is the harmonic mean of precision and recall. It provides a balance between the two metrics, giving a single score that considers both false positives and false negatives.
Our early scores, using a reduced document set, left much to be desired. There were two focus areas that really improved the metrics and responses of our model:
- We began to adjust different RAG parameters and configurations as we added documents to the base model
- We also changed our chunking approach and added a re-ranker.
Building out the application – scaling up
Once we had a base application with acceptable precision and recall we built out the full production application. We developed and implemented the data pipelines required to keep the documents flowing. We wrote a UI and integrated the application with our existing security processing. We continued to add employee policy documents and worked to parse documents that included more than just text. The whole time we continued to measure precision, recall, and latency.
Using Databricks and Lessons Learned
We used Databricks technology throughout the application. That worked quite well. Some of our key learnings are below:
- Databricks simplifies the process of creating a RAG solution
- Unity Catalog as a centralized metadata store that provides unified access to data is a great benefit
- The vector search endpoint serves embeddings for vector search index (like an API)
- Complexity of RAG implementation depends on inputs and desired features
- Different file types have different data processing requirements
- Out-of-box functions from Databricks demos rely on specific data formatting that may be tricky to achieve
- RAG chain itself, including chat history and other features, needs to be programmatically linked to LLM model
- Level of data prep and pipeline engineering required can’t be underestimated
- Many RAG demos focus on model building and downplay the need for data processing. The quality of the data you feed into a RAG implementation greatly impacts the results.
- There needs to be a plan to “productionize” the data pipelines feeding data to the RAG model
- Appreciation for chunking strategies that boost performance and customize output
- There are numerous LLM models, chunking methods, etc. to try out and fine-tune
- There are multiple types of vector search indexes: Managed embeddings, Self-managed embeddings, Direct index
Conclusion
Building RAG-based applications is considerably different than building typical systems. The process is much more concerned with mixing and matching the right components than writing large amounts of code. One big consideration for RAG-based applications is the format and quality of the documents being turned into embeddings. Well-constructed, well written documents simplify the process. However, many internal documents in enterprises are not particularly well organized or well written. That’s where the use of different chunking approaches, query rewriting modules, re-rankers and autocut modules can be used to improve precision and recall. The key is to monitor the metrics throughout the process and use these optional capabilities only as necessary.