Infinitive Uses Retrieval Augmented Generation (RAG) F1 Scores to Guide Development

Summary

Retrieval Augmented Generation (RAG) is a relatively new architecture which allows private data to be accessed through natural language queries using Large Language Model (LLM) technology. Applications like customer service have benefitted from this architecture. Swedish fintech company Klarna reports that it has implemented RAG for customer service with the RAG-based application doing the work of 700 human customer service agents. This reportedly will drive a $40M profit improvement for Klarna in 2024.

Equally important, today’s RAG applications set the foundation for tomorrow. A RAG application built today will grow in capability as the underlying AI progresses.

OpenAI’s five stages of artificial intelligence (AI) are:

  • Stage 1: Chatbots: AI with conversational language, like what is seen in present-day chatbots
  • Stage 2: Reasoners: AI that can solve problems at a human level
  • Stage 3: Agents: AI systems that can take actions on a user’s behalf
  • Stage 4: Innovators: AI that can help with invention
  • Stage 5: Organizations: AI that can perform the work of an organization

Last Thursday (Sept 12) OpenAI released OpenAI o1 in preview. This is the first step for Stage 2 (Reasoners). As future stages are released, the power of RAG infrastructures will only grow.

RAG application development is more about assembling the right components than writing code. Some components (like the chunking module) are required. Other components are optional but help improve recall and accuracy. The goal is to get the level of accuracy required for the application using the minimum number of components.

Infinitive uses the F1 Score as a guideline for deciding which, if any, optional components to use.

What is the F1 Score?

The F1 score is a measure used to evaluate the performance of a Retrieval-Augmented Generation (RAG) application, especially when dealing with tasks like answering questions or retrieving relevant information. It combines two important aspects: precision and recall. Precision measures how many of the results provided by the RAG application are correct, while recall measures how many of the correct results were successfully retrieved by the application. The F1 score is the balance between these two, providing a single number that reflects both accuracy (precision) and completeness (recall). An F1 score ranges from 0 to 1, where 1 means perfect precision and recall, indicating the RAG application is giving mostly correct answers and finding as many relevant pieces of information as possible. In simple terms, the F1 score helps developers and users understand how well the RAG application is doing its job.

Calculating the F1 Score:

The F1 score is the harmonic mean of two subsidiary calculations – precision and recall.

Precision is the ratio of correctly retrieved items to the total number of items retrieved.

Precision = True Positives / (True Positives + False Positives)

Recall is the ratio of correctly retrieved items to the total number of relevant items.

Recall = True Positives / (True Positives + False Negatives)

F1 is the harmonic mean of precision and recall

F1 = 2 X ((Precision X Recall) / (Precision + Recall))

Understanding the F1 Score:

For a Retrieval-Augmented Generation (RAG) application, F1 scores can vary depending on the complexity of the task, the quality of the data, and the specific domain in which the application is used. However, we can provide a general guideline for interpreting F1 scores:

  • Poor F1 Score (0.0 – 0.5): An F1 score in this range indicates that the RAG application is not performing well. It means either the precision (correctness of the retrieved results) or recall (ability to find relevant results) is quite low. This could suggest that the system is retrieving a lot of incorrect information or missing a lot of relevant information, making it unreliable for users.
  • Average F1 Score (0.5 – 0.7): An F1 score in this range suggests moderate performance. The RAG application has a balance between precision and recall, but there is still room for improvement. It’s capturing some relevant information correctly but may still provide incorrect results or miss some important information.
  • Good F1 Score (0.7 – 1.0): An F1 score in this range indicates strong performance. The application is generally good at finding the right information (high recall) and making sure the information it provides is accurate (high precision). A score closer to 1.0 means the RAG system is highly reliable, providing mostly correct and comprehensive responses.

In practice, what is considered a “good” F1 score can depend on the specific application and industry. For example, in a high-stakes domain like healthcare, a higher F1 score (e.g., above 0.8) might be necessary, while in more general tasks, an F1 score of around 0.7 might be acceptable.

Running an F1 test.

The approach for determining the F1 score is based on “old school” data preparation. Test prompts are developed along with the expected results from those prompts creating an evaluation set. In the Databricks-based architecture that Infinitive uses for internal RAG development, MosaicAI is used to execute these evaluation sets and calculate F1.

Coming in next week’s blog

See how Infinitive uses optional RAG components to improve F1 Score.