Unlocking the Potential of PDF Data with Custom Chunking in Amazon Bedrock
This post is co-written with Kristina Olesova, Zdenko Estok, and Selimcan Sakar from Accenture.
In today’s data-driven world, organizations often face the challenge of extracting structured information from unstructured PDF documents. These PDFs can contain a myriad of elements, such as images, tables, headers, and text formatted in various styles, making it difficult to parse and analyze the data efficiently.
Additionally, the performance of chatbots and other natural language processing (NLP) applications depends heavily on the chunking strategy employed. Improper chunking can lead to loss of context, resulting in hallucinations and inaccurate responses. Also, the performance of language models is further influenced by the chunk size, where smaller chunks provide more granular information but struggle with generalization, whereas larger chunks might miss important details.
This post explores how Accenture used the customization capabilities of Knowledge Bases for Amazon Bedrock to incorporate their data processing workflow and custom logic to create a custom chunking mechanism that enhances the performance of Retrieval Augmented Generation (RAG) and unlock the potential of your PDF data.
Solution Overview
The Accenture team created a knowledge base with the financial results of Accenture for every quarter from 2020–2024. This document contained images, tables, text stored in different formats, and other noise elements.
In this use case, we wanted to extract granular information contained in the tables and also preserve the good generalization capabilities of foundation models (FMs) to respond to general questions about financial results.
After testing, we found that the search mechanism wasn’t able to correctly retrieve the information for the years and quarters specified in the prompt. The following screenshot shows an example where the query was for information from the first quarter of 2023, but the search mechanism returned information from the first quarter of 2020.
We couldn’t extract the correct chunk of data using different search strategies or by changing the number of retrieved chunks. After more vigorous testing, we identified struggles with parsing the tabular information and retrieving the correct data. Because the issues were related to the inability of the search algorithm to select the correct chunks, we decided to change the chunking strategy and try the new features in Amazon Bedrock.
The architectural flow of the updated solution is as follows:
- Begin by creating a data source with all the data stored in Amazon Simple Storage Service (Amazon S3) or another database. This can include custom PDFs with tables, forms, and other complex elements.
- Run Amazon Textract on the PDFs stored in your data source. Amazon Textract is a highly accurate service that can extract text, tables, and other data from virtually any document.
- Create chunks based on the extractions from paragraphs in the Amazon Textract output. For every chunk, include additional metadata such as chapter titles and document names to preserve context.
- Embed the chunked files into vectors using the console for Knowledge Bases for Amazon Bedrock. Select no chunking while creating a vector representation of chunks.
- Set up the system prompt, search strategies, number of chunks, and metadata filtering if applicable and ask the user for a question.
- Use the vector-search feature of Amazon OpenSearch Service to select the most similar embedded chunks to the user query (prompt).
- Call a FM from Amazon Bedrock on the chunks provided by OpenSearch Service and get the answer.
The steps in the workflow are orchestrated using AWS Lambda, as shown in the following diagram:
The chunking mechanism uses Amazon Textract to detect paragraphs, tables, images, chapter titles, and other PDF layout elements to improve the chunking (without splitting the text in the middle of a sentence or paragraph), eliminate noise, and provide more context for metadata generation. We can use this metadata directly during filtration or as a hint in a prompt template to improve the accuracy of the generated response. Using the specified logic for every PDF element, we can take the correct actions depending on the category of the element.
The main PDF elements are as follows:
- Tables – Tables are the most difficult layout elements in a PDF. The information can be correctly extracted only when headers and column names are correctly identified. This is difficult to achieve with fixed size chunking because there is no way to guarantee that headers will be present in the chunk, together with all the row information. We can use table detection to extract a table and save it in a CSV file, or even directly use it in a database as a data source for agents.
- Images – If the text contains images connected to user instructions, the images can be detected and tagged during preprocessing. Later, these images can be stored in Amazon S3 and displayed in a chat window using relevant tags.
- Page numbers, headers, and footers – This text information doesn’t bring any valuable information for RAG models, and it can confuse them significantly. Moreover, storing page headers and footers can take up significant space in the vector database and incur significant cost with negligible benefits.
- Chapter titles and subtitles – In many documents, chapter titles describe the context of the chapter. This information can help us tag the chunks using metadata, or directly include this information in the filtering process, thereby improving the accuracy and speed of extraction.
Use custom chunking with Knowledge Bases in Amazon Bedrock
In this section, we demonstrate how to use the proposed custom chunking solution.
Note: Keep in mind that the content and code provided are for informational purposes only. You should do an independent assessment before running anything in response to the information that follows.
This involves the following steps:
- Specify the custom metadata for every financial document that you want to include in the analysis. For this post, we specified the information for quarter, fiscal year, company, and other fields:
metadata = {
"metadataAttributes": {
"document_name": document_name.split(".pdf")[0],
"fiscal_year": fiscal_year,
"quarter":quarter,
"main_topic": "",
"secondary_topic": " ",
"format": "Text"
}
}
- Split the PDF files into multiple images or single PDF files. It’s important to have high resolution to properly distinguish all the characters within the files.
- Invoke Amazon Textract to detect the layout items and table items:
def textract_data(self,output):
image = Image.open(output)
document = self.extractor.analyze_document(
file_source=image,
features=[TextractFeatures.LAYOUT,TextractFeatures.TABLES],
save_image=True
)
new_layout=self.save_table(document)
self.save_text(new_layout)
- Save the table information. In this example, we’re using Anthropic’s Claude models, which are able to correctly parse files in CSV format. Export all the tables detected as a CSV, and save the table names and specified table format as additional metadata:
def save_table(self, document):
table_count = 0
if document.tables:
for layout in document.layouts:
if layout.layout_type in 'LAYOUT_TITLE':
self.metadata["metadataAttributes"]["main_topic"] = layout.text
elif layout.layout_type == 'LAYOUT_SECTION_HEADER':
self.metadata["metadataAttributes"]["secondary_topic"] = layout.text
elif layout.layout_type == 'LAYOUT_TABLE':
table = document.tables[table_count]
df_table = table.to_pandas()
self.metadata["metadataAttributes"]["format"] = "Table"
t_file=self.tables_directory + f'/{self.document_name}_table_p{self.page_number}_t{table_count}.csv'
with open(t_file,'w') as csv_file:
csv_file.write(df_table.to_csv(index=False, header=False))
with open(t_file + ".metadata.json",'w') as json_file:
json.dump(self.metadata, json_file)
table_count = table_count + 1
- Further processing is required for information other than tables and images. We create metadata tags containing the information about main chapter titles and subtitles. This information can help you boost performance using metadata filtering or during vector search using a system prompt. For every chunk of data, specify within the metadata to which chapter and subchapter it belongs. Ideally, you should always have one chunk of data for every subchapter, but this isn’t always possible. Many subchapters are too long to be parsed with one chunk. In such cases, you can split the text after the paragraph and use the same metadata for another chunk:
for layout in document:
if layout.layout_type in 'LAYOUT_TITLE':
self.metadata["metadataAttributes"]["main_topic"] = layout.text
elif layout.layout_type == 'LAYOUT_SECTION_HEADER': // split text at the beginning of every subchapter
self.create_chunk() //save the previous chunk in chunk_dic
for chunk in self.chunk_dic: // save all of the chunks for a given chapter
self.metadata["metadataAttributes"]["format"] = "Text"
with open(chunk["output_path"], 'w') as text_file: //create a TXT file with the specified text
text_file.write(chunk["text"] + str(chunk['metadata']))
with open(chunk["output_path"] + ".metadata.json", 'w') as json_file: //create metadata file for the given chunk
json.dump(chunk['metadata'], json_file)
self.subtitle = []
self.chunk_dic = []
self.metadata["metadataAttributes"]["secondary_topic"] = layout.text
elif layout.layout_type in ['LAYOUT_LIST', 'LAYOUT_TEXT']:
if (len(self.new_chunk + layout.text) > chunk_max) and (len(self.new_chunk) > chunk_min): // if the text within the chapter is too big split it at the end of the paragraph
self.create_chunk()
self.new_chunk = self.new_chunk + layout.text
The benefit of this method is that, even if the text continues on the next page, this mechanism is able to assign it to the correct chunk (if the text is within the limited vector space). This helps prevent splitting the text in the middle of a sentence, which can often lead to hallucinations.
- After the text is split, create two files for every chunk:
- A .txt chunk file together with the metadata string.
- A
metadata.json
file that can be used with the knowledge base metadata and filtering.
- When the split is complete, upload the files to Amazon S3 and continue with creating the knowledge base using the no chunking option.
When using the custom chunk option, keep in mind the maximum size of possible chunks. If the text chunk is too large, the vectorization of the files will fail, and the file won’t be available for the knowledge base.
Benefits of Custom Chunking
Custom chunking offers the following benefits:
- Context preservation – By chunking text based on chapters or subchapters, you can make sure that the context of each section remains relevant throughout the chunk, resulting in more accurate vector representations and reducing noise.
- Flexible chunk sizes – Custom chunking allows you to dynamically adjust the chunk sizes, addressing the challenge of selecting the optimal chunk size for different use cases.
- Improved retrieval performance – With custom chunking and the advanced retrieval capabilities of Amazon Bedrock such as metadata filtering, you can significantly enhance the performance of your retrieval frameworks, enabling faster and more accurate insights.
- Seamless integration – Amazon Bedrock seamlessly integrates with other AWS services, such as Amazon S3 and Amazon Textract, providing a streamlined solution for data extraction, organization, and analysis.
Metadata Filtering Compared to System Prompts
Metadata filtering is a powerful feature that significantly enhances the search algorithm’s performance. By using metadata filtering to specify fiscal years and quarters, we achieved notable improvements in response accuracy. Currently, the Amazon Bedrock console requires users to have prior knowledge of metadata filter names and their corresponding values. As of this writing, direct specification of these filters through prompts isn’t supported. Consequently, in practical applications, users would benefit from guidance or hints to assist them in selecting appropriate filter values.
The following figure shows an example of enabling metadata filtering for the same model and chunking logic. In the first question, using only the prompts, the search algorithm failed to provide chunks from the correct documents. In the second question, we filtered by fiscal year (2023) and quarter (Q3). The output of the search algorithm was just one chunk, but the correct one.
Performance Comparison
We compared fixed chunking, custom chunking, and custom chunking with prompts. For vectorization, we used the Amazon Titan Embeddings Text v1 model for custom chunking, baseline, and metadata filtering. We performed additional knowledge base testing with Cohere. We performed all the testing with the Claude Sonnet 3 model and hybrid search, with a maximum retrieved result of 20.
We tested the performance of the models on several tasks:
- Table information – Information only extractable from tables.
- Long questions – Summarizing chapters using multiple chunks. This is a difficult task for models with a small embedding window.
- Year-specific questions – The answers are very short and clear, but the correct extraction relies on the capability of vector search to determine the time span from the user question and extract the chunk corresponding to a given time span.
We evaluated the performance manually by checking factually against the information generated by the model with the source data. The following screenshots show some example questions and answers generated on two different knowledge bases for the year_sensitive
class.
The first example uses custom chunking with an Amazon Titan Embeddings model.
The next example uses Cohere with fixed chunking.
We used the prompt template feature released in April 2024 to focus on detailed information regarding the fiscal years and quarters. This information was the same as it was in the metadata JSON file, and it gives the models some guidelines about what information is important for extracting the valid chunks. The following is an example of the system prompt:
User:
You are a question answering agent specializing in companies' financial statements and reviews. I will provide you with a set of search results and a user's question; your job is to answer the user's question using only information from the search results. Before answering the question, think step by step and verify your response based on the metadataAttributes provided in {} brackets. If provided in the user’s question, always check that the fiscal_year and quarter match with the values provided. In case of the user asking specific questions about the financial outcome of a specific group (such as revenues or net income) focus on search results that have "Table" specified in the format tag in metadataAttributes. To improve the results, you can verify the values of main and secondary topics. The values should be related to the user’s questions.
Here are the search results in numbered order:
$search_results$
Here is the user's question:
$query$
$output_format_instructions$
Assistant:
The adjusted prompt template improved the accuracy of the results. For the knowledge base created with an Amazon Titan Embeddings model and fixed chunking, the accuracy of extracted results increased to 70 percent accuracy. This number served as a baseline for our evaluation.
After switching from fixed chunking to custom chunking with Amazon Titan, the accuracy of retrieved results increased by 17 percent.
Interestingly, Cohere led to similar results as using custom chunking with regards to response accuracy but showed slightly less precise richness in summarization (long answers