Building a RAG App with Databricks

Building and Deploying a Retrieval-Augmented Generation (RAG) Application

Summary

AI is taking the business world by storm, and companies of all sizes are finding themselves in a sink-or-swim position amidst the flood of automation. Business owners are asking themselves a whirlwind of questions such as: “What is AI?” “How do we use it?” and, most importantly “How do we build AI-driven solutions?”

Infinitive, a data and AI consultancy headquartered in Ashburn, VA – outside Washington, DC, has recently developed and implemented a Retrieval-Augmented Generation (RAG) application to streamline the onboarding process for new employees by automating responses to common HR questions.

Building RAG-based applications requires a significantly different approach compared to traditional systems. Instead of focusing on writing large amounts of code, the process relies heavily on strategically selecting and combining the right components. A key factor in the success of RAG applications is the quality and format of the documents use as input, which are converted into embeddings for RAG retrieval. Well written documents make the process much easier; however, in many enterprises, internal documents are not structured with a RAG use case in mind.

This blog will walk through how Infinitive built and deployed their RAG application using Databricks Apps, highlighting the challenges encountered and the solutions implemented along the way. If you want more information about Infinitive’s RAG solution download our free eBook on our step-by-step RAG implementation here.

RAG Development Approach

Retrieval Augmented Generation (RAG) is a relatively new yet powerful architecture that enhances large language models (LLMs) by enabling them to retrieve relevant information from data sources such as company documents. In Infinitive’s case, the data sources for building such a RAG model consist of internal documents containing HR policies—such as vacation policies, benefits information, training, and other employee-related content. Below is a breakdown of the approach taken by Infinitive to implement their RAG model, outlining the process from data sourcing to retrieval and response generation, ensuring a seamless integration of AI with HR policy information.

Data Sourcing and Preparation
- The first step in building a RAG model was sourcing and preparing relevant data—in this case, PDFs of Infinitive’s HR policies. These documents contain information employees need for HR-related questions. Once data was sourced, a Python PDF reader was used to extract the text. For image-based PDFs, OCR tools could be used, but that wasn’t needed here. The extracted text was then cleaned to remove unnecessary formatting.
Chunking the Data
- After cleaning the data, the next step is usually chunking the content into smaller, manageable pieces. This helps divide large documents into relevant sections, like vacation time or approval processes in a vacation policy. However, since Infinitive’s PDFs were already well-organized into sections, chunking wasn’t necessary. If needed, tools like SpaCy or NLTK could segment the text, ensuring the chunks are small enough for fast retrieval but large enough to provide context.
Building the Retrieval System
- Next, the retrieval system was set up to quickly identify the most relevant data when a user asks a question. Using the Databricks-bge-large-en embeddings model, the system calculates similarity scores for each chunk based on the user’s query, ensuring the most relevant data is retrieved efficiently.
Integrating LangChain for Response Generation
- Once relevant chunks are retrieved, LangChain formats them into a structured prompt for the Databricks-dbrx-instruct model, a Databricks-provided model designed for instruction-based tasks. The model uses the provided parameters and context—the specific information retrieved from the HR policies or related data—to generate accurate responses. If the context doesn’t contain the answer—such as when a user asks about the weather, which isn’t part of the HR policies—the model will respond with “I don’t know” rather than hallucinating an answer. LangChain ensures seamless integration between the retrieval system and the Databricks-dbrx-instruct model, delivering contextually relevant, accurate, and precise answers based on the retrieved data.

Refining the RAG model

Once the core components of the RAG application were in place, Infinitive focused on ensuring that the retrieval system worked as intended—efficiently and accurately pulling the most relevant data in response to user queries. To achieve this, Infinitive employed a metrics-based development approach. This method revolves around quickly building a scaled-back version of the retrieval system, measuring its performance with key accuracy and completeness metrics, and continuously improving it before scaling further.

Retrieval Precision: This measures how many of the retrieved documents are relevant to the query, ensuring the system returns accurate results.
Retrieval Recall: This measures how well the system captures all relevant documents in the dataset, ensuring that no important information is missed.
Retrieval F1 Score: The F1 score combines precision and recall into one metric, balancing the need for both accuracy and completeness.

Early scores, using a reduced document set, left much to be desired. There were two focus areas that significantly improved the metrics and responses of Infinitives model:

Adjustment of RAG parameters and configurations – such as context for the LLM while adding source documents to the base model.
Pre chunked PDFs containing HR policies were sourced and fed into the RAG. If, in this case, PDFs were not pre-chunked, chunking mechanisms outlined above could have been used to accomplish the same effect.

Building Out the Application – Scaling Up

Once Infinitive had a base application with acceptable precision and recall, the full production application was built out. The team developed and implemented the data pipelines required to keep the documents flowing. Infinitive continued to add employee policy documents and worked to parse documents that included more than just text. Throughout this process, the team kept measuring precision, recall, and latency.

Deployment on Databricks Apps

After building the production model, Infinitive was faced with another decision: Where do they deploy it? There are many concerns with deploying an AI model online, such as allocating important personnel for maintenance, integrating the front end to the programming, and security concerns when deploying anything with company information online. To rectify these concerns and avoid too much overhead to support their application, Infinitive deployed their model on Databricks Apps.

Integrating Infinitive’s model with Databricks Apps was quick and seamless, requiring just a few clicks and a simple search for the workbook where the model was built. Hosting AI chatbots on Databricks Apps means Databricks handles the website, saving companies time and costs that would otherwise go toward employing full-stack engineers for development, deployment, and maintenance. This way, Infinitive doesn’t have to worry about crashes, bugs, data loss, or any of the issues that come with hosting a website in-house.

Once the model was deployed, Infinitive wanted to ensure that only employees had access to the RAG application for data security purposes. Databricks Apps provides granular access controls, user authentication, and the powerful governance tool Unity Catalog to ensure data security.

Using Databricks and Lessons Learned

Infinitive used Databricks technology throughout the application, and it worked quite well. Some of their key learnings are:

Databricks simplifies the process of creating a RAG solution.
- Unity Catalog, as a centralized metadata store that provides unified access to data, offers great benefits.
- The vector search endpoint provides embeddings for the vector search index, similar to an API.
The complexity of RAG implementation depends on inputs and desired features.
- Different file types have unique data processing requirements, such as scanned PDFs versus text-based PDFs.
- Out-of-the-box functions from Databricks demos require specific data formatting, which can be challenging to achieve.
- The RAG chain, including chat history and other features, must be programmatically linked to the LLM model.
The level of data prep and pipeline engineering required can’t be underestimated.
- Many RAG demos focus on model building and downplay the need for data processing. The quality of the data fed into a RAG implementation greatly impacts the results.
- A plan is needed to “productionize” the data pipelines feeding data to the RAG model.
Appreciation for chunking strategies that boost performance and customize output.
- There are numerous LLM models, chunking methods, and other techniques to try out and fine-tune
- There are several types of vector search indexes: managed embeddings, self-managed embeddings, and direct indexes.

Conclusion

Infinitive is now able to use natural language to communicate with their data, without the headaches of maintaining a website or the fear of data leaks. To deploy your company’s models on Databricks Apps, visit Databricks Apps | Databricks. To learn more about RAG and how to build your own model, read Infinitive’s RAG ebook and browse our other Ebooks here.

Building and Deploying a Retrieval-Augmented Generation (RAG) Application

Summary

RAG Development Approach

Refining the RAG model

Building Out the Application – Scaling Up

Deployment on Databricks Apps

Using Databricks and Lessons Learned

Conclusion

Contact