In AI-driven systems like RAG, document loading quality is a key factor in enhancing relevance, coherence, and reducing errors. This article is merely a reminder to pay more attention to improve document loading quality and tp spend a bit more time processing before running the heavy gears.
The rise of generative-ai in the last couple of years has changed the way people interact with intelligent systems. This is especially obvious with the breakthroughs achieved in natural language processing with LLMs (large language models). People started immediately to think about ways to improve the output quality of these language models and find solutions to their limitations. One solution that particularly tackled a huge problem of LLMs was RAG chains (Retrieval augmented generation). You probably heard this term a lot in the online ai community, but you might wonder what is RAG?
what exactly is RAG?
As the name implies, RAG is a way to help LLMs give more tailored answers to domain specific queries or to tackle questions that it was not trained on by retrieving relevant information from a knowledge base and providing them as context to the LLM. A simplified version consists of loading a document, chunking its content to pieces of smaller length then embedding these chunks in a VectorStore. When a user provides a question, the embedded image of this question is compared via similarity search with the content in the vectorestore, and the more relevant pieces of document will help the LLM answer the user.
Here is a diagram that can help you understand more about RAG systems. For more details, we created this step-by-step implementation guide.
As you can see, retrieving documents and feeding them to the LLM is the backbone of the entire operation. Therefore, having bad documents and non-accurate extraction will not only result in bad relevance search, but will also result in mediocre knowledge parsing especially for tabular and documents with special formats (i.e. receipts, scientific papers and magazines, mathematical formulas, figures).
Areas of Impact :
Relevance: High-quality documents help improve the relevance of generated content.
Coherence: The retrieval of well-structured and coherent documents leads to more logically consistent generated outputs.
Factuality: Higher-quality documents can reduce the risk of factual errors (hallucinations) in the generated content.
Case Studies :
The following section is for the more skeptical readers. If you are not already convinced that document quality and processing is crucial for your project, these case studies should be able to convince you to pay more attention to your data extraction and document processing.
Personal project : RAG system for enhancing mathematical question answering
This case study is related to a specific problem that I encountered while developing an agentic RAG system that helps students with their mathematical questions.
Objective: The idea was to augment Llama 3.1 with domain-specific knowledge from math textbooks.
Problem: Loading the textbook using generic pdf loaders resulted in poor mathematical expression semantic capturing. Traditional OCR solutions of textbook page image could not help either. This affected the chunks embedding and therefore retrieval. The retrieved documents were not deemed relevant by an LLM semantic relevance grader.
Solution: Improvement were achieved using a custom fast R-CNN mathematical formula detector trained on FormulaNet to identify the coordinates of the expression in the textbook page image. The formula got cropped and fed into a LaTex generator engine (pic2tex). And the rest of the plain text got generated using Easy-OCR. The two engines are working together to provide efficient and accurate document loading (even handwritten) that improved Retrieval and answer generation.
Research case study: Enhancing Image Retrieval in RAG Systems (RAG Beyond Text)
Problem: Traditional image retrieval in RAG systems relying on OCR-LLM techniques struggles with OCR-incompatible images and cases where image and text do not semantically align.
Approach: They developed a methodology that leverages positional information from multi-modal documents (text and images) and uses advanced retrieval and prompt engineering to improve image retrieval. This system maintains the integrity of both text and image data in responses.
Results :
Achieved State of The Art (SoTA) performance across simple and complex queries.
Outperformed models like GPT-4 Vision in retrieving OCR-incompatible images, particularly in customized scientific diagrams.
Successfully retrieved images important for query completeness, even when the text did not align with the image content.
Industrial case study : Dealing with tabular data in financial documents
In Retrieval-Augmented Generation (RAG) workflows, handling documents containing tables, such as earnings reports, poses significant challenges due to poor retrieval and inaccurate value generation. This case study focuses on how improving the quality of document preprocessing, extraction, and formatting can enhance table retrieval and reduce errors in answers derived from tables.
Problem : Vector search algorithms often fail to pinpoint the correct tables in documents with many similar or related tables. But also LLMs misinterpret values within tables due to inconsistent formatting, leading to incorrect answers, particularly in complex tables with nested structures.
Solution: Enhancing Document Quality for RAG
Precise Table Extraction:
Improved document quality by using a reliable extraction tool (Unstructured.io) to cleanly extract all tables, ensuring no loss of information.
Contextual Enrichment:
Improved table context by generating rich, descriptive metadata for each table using an LLM. This added context helped the RAG system retrieve the correct table and understand its relevance within the document.
Format Standardization:
Addressed formatting inconsistencies by converting tables into a standardized markdown format. Uniform formatting helped prevent LLM confusion during the retrieval and generation stages.
Unified Embedding:
Combined the enriched context and standardized tables into single “table chunks” for optimized vector storage and more precise retrieval during queries.
Results:
Increased Retrieval Accuracy: The improved document quality through contextualization and formatting led to a more reliable selection of the correct tables, particularly in complex documents with multiple tables.
Reduced Generation Errors: By standardizing table formats, LLMs better interpreted table values, resulting in more accurate responses.
Impact on RAG Performance: The structured, high-quality document processing directly improved RAG system outcomes, particularly in finance-related documents like earnings reports where accurate table data is crucial.
Document processing techniques
Intelligent Document Processing (IDP)
In the realm of document processing, Intelligent Document Processing (IDP) stands out as a transformative technology.
Intelligent Document Processing (IDP) goes beyond traditional data extraction methods by incorporating advanced technologies to understand and process unstructured data. Unlike basic OCR, which simply converts images of text into machine-readable text, IDP can classify documents, extract relevant information, and validate data accuracy. This makes it an ideal solution for handling complex documents such as invoices, contracts, and forms, where key details like dates, amounts, and names need to be accurately extracted without human intervention.
Document Classification
How It’s Done: Document classification involves categorizing documents into predefined types, such as invoices, purchase orders, or receipts. This is achieved using machine learning models trained on labeled datasets.
Technologies and Architectures:
Convolutional Neural Networks (CNNs): Used for image-based document classification, CNNs can analyze the visual layout and features of a document to determine its type.
Natural Language Processing (NLP): Techniques like BERT (Bidirectional Encoder Representations from Transformers) are used to understand the text content and context within documents, aiding in accurate classification.
Amazon Comprehend: This AWS service uses NLP to extract insights and relationships in text, helping classify documents based on their content.
Data Extraction
How It’s Done: Data extraction involves identifying and pulling out relevant data fields from documents. This can include text, tables, and other structured or unstructured data.
Technologies and Architectures:
Optical Character Recognition (OCR): Tools like Tesseract and ABBYY FineReader convert scanned images of text into machine-readable text.
Amazon Textract: This AWS service automatically extracts text and data from scanned documents, including forms and tables.
Deep Learning Models: Techniques like Long Short-Term Memory (LSTM) networks and Transformers are used to extract and understand complex data patterns within documents.
Data Validation
How It’s Done: Data validation ensures the accuracy and consistency of the extracted data by checking it against predefined rules or external databases.
Technologies and Architectures:
Rule-Based Systems: Simple validation rules can be implemented using regular expressions or custom scripts to check data formats and values.
Machine Learning Models: Supervised learning models can be trained to recognize valid data patterns and flag anomalies.
Amazon SageMaker: This AWS service provides tools to build, train, and deploy machine learning models for data validation tasks.
Continuous Learning
How It’s Done: Continuous learning involves the system improving its accuracy and efficiency over time by learning from new data and feedback.
Technologies and Architectures:
Reinforcement Learning: This approach allows models to learn from their interactions with the environment, improving their performance based on feedback.
Active Learning: This technique involves retraining models on new data that the system is uncertain about, improving its accuracy over time.
A post-processing technique
Re-ranking
This technique deserves an entire blog post on its own, luckily we made one here to help you have a more in depth understanding, but for now, here is a quick overview:
How It’s Done: Re-ranking is a technique used to refine the initial set of retrieved documents, ensuring that the most relevant documents are prioritized for further processing or generation.
Technologies and Architectures:
Cross-Encoders: These models jointly encode the query and each candidate document, providing a more precise relevance score by considering the interactions between them. BERT-based re-rankers are commonly used for this purpose.
Graph Neural Networks (GNNs): Techniques like Graph Attention Networks (GAT) can model the relationships between entities in the query and the documents, enhancing the relevance of the re-ranked results.
Bi-Encoders and Cross-Encoders: In the retrieval phase, bi-encoders (e.g., Sentence-BERT) are used to retrieve candidate documents by encoding both the query and documents into vector embeddings. In the re-ranking phase, cross-encoders reassess the relevance of each document by jointly processing the query and document.
Conclusion
By now, you should be convinced that spending a little bit more time gathering good quality documents and data for your RAG system might be the solution you need. Processing those documents is equally important, from organization, to structured data extraction, special format handling and robust ingestion pipelines. Suggested read : Advanced RAG system: beyond essentials.