Amazon Q Business – Framework for Evaluating Accuracy

Contents

Revolutionizing Enterprise Operations with Generative AI like Amazon Q Business Different evaluation methods for generative AI solutions Identify use cases that are suitable for Amazon Q Business Choose a representative dataset that reflects employees’ day-to-day queries Generate a list of queries Select metrics for evaluation Scoring system Solution architecture Improve Amazon Q Business response after the evaluation Conclusion About the Authors

Revolutionizing Enterprise Operations with Generative AI like Amazon Q Business

Generative artificial intelligence (AI), particularly Retrieval Augmented Generation (RAG) solutions, are rapidly demonstrating their vast potential to revolutionize enterprise operations. RAG models combine the strengths of information retrieval systems with advanced natural language generation, enabling more contextually accurate and informative outputs. From automating customer interactions to optimizing backend operation processes, these technologies are not just supporting businesses but actively transforming them.

Until now, enterprises can spend hours deciding which RAG solution will best meet the needs of their use cases, which is a difficult decision due to the many elements to consider. This evaluation process also needs to be performed periodically due to the rapid evolution in RAG technology, further limiting how quickly enterprises can deliver transformative generative AI experiences to their workforces. Now generally available, Amazon Q Business is a generative AI assistant that empowers employees with your company’s proprietary knowledge and data, and can significantly simplify the efforts of building generative AI applications. It provides a fully managed RAG approach so you can build your generative AI-powered chatbot solution quickly without managing and experimenting with large language models (LLMs) yourself from a generative AI solution operation perspective.

Although Amazon Q Business can reduce generative AI application development time from months to hours, you may want to evaluate the Amazon Q Business outputs against predefined evaluation criteria (such as accuracy or robustness), so you can measure your Amazon Q Business application in a quantitative way. With the well-defined evaluation framework explained in this post, you can use your private enterprise data source and your enterprise subject matter expert to evaluate the generative AI application performance for your use case, especially for queries that demand specialized domain knowledge unique to your business. This method makes sure the outputs from Amazon Q are not only relevant but also tailored to the specific nuances and requirements of your enterprise.

Different evaluation methods for generative AI solutions

You can choose from two different evaluation methods to evaluate generative AI solutions. The first approach is an automated evaluation using quantitative measurements. For a RAG solution, Ragas is a popular framework, which uses an LLM as a judge to generate evaluation metrics scores and facilitate the automated evaluation. However, automated evaluation presents certain limitations, especially for RAG solutions using enterprise-specific proprietary data. These metrics often fail to capture the full complexity of human-like language generation, lacking the ability to assess semantic understanding and the contextual nuances unique to a specific domain.

The second approach is a human in the loop (HITL) evaluation. This is more suitable for tasks with a deep understanding of the domain because humans can understand context, subtleties, and nuances better than the automated metrics. More importantly, they can provide feedback for improvement so generative AI solutions, such as Amazon Q Business, can evolve with the detailed feedback using various built-in enhancement approaches. Lastly, HITL can bring qualitative assessments and human judgment that automated evaluation metrics lack.

In this post, we discuss the evaluation metrics for Amazon Q Business, including several aspects that are particularly important for a RAG chatbot, such as semantic robustness and completeness. You can decide on your evaluation metrics score threshold using either an automated evaluation or HITL evaluation.

Identify use cases that are suitable for Amazon Q Business

Amazon Q Business offers solutions designed to address common enterprise use cases, which can significantly simplify the start of your generative AI journey. Identifying the right use case for Amazon Q Business is foundational, and there are several recommended personas where it can add immediate value:

Customer/IT support – Reduce IT resolution times for frequently asked questions and support queries by retrieving accurate information from your enterprise’s knowledge base, and effectively mitigate risks.
Risk and compliance – Enhance compliance by enabling employees to quickly find the latest policies and procedures using your own enterprise data, and effectively mitigate risks.
Developers – Streamline the evaluation of technical debt and code quality across extensive knowledge bases spanning multiple repositories and content collaboration and management workspace to address developers’ questions.
Human resources – Expedite the onboarding process for new hires by efficiently finding the most pertinent documentation, thereby reducing the burden associated with navigating and sifting through voluminous content.
Operations – Accelerate equipment maintenance and troubleshooting for plant operators and technicians with quick access to detailed information, such as user manuals, QuickStart guides, technical manuals, maintenance logs, safety data sheets, and more.

Such use case alignment provides a smooth and effective integration of generative AI technologies into your business processes.

Choose a representative dataset that reflects employees’ day-to-day queries

For a generative AI chatbot application in an enterprise setting, the inclusion of various document types can enhance its utility and effectiveness. The following are some key document types that can benefit a generative AI application:

Standard operating procedures – These documents provide step-by-step instructions that can help Amazon Q guide users through specific processes or tasks.
Product manuals and guides – Product-related documents enable Amazon Q to provide detailed information and troubleshooting assistance to users.
FAQs and help articles – FAQs and structured help documents allow Amazon Q to quickly provide answers to common questions, improving user experience and efficiency.
Policy documents – Company policy documents help Amazon Q inform employees about company regulations, compliance requirements, and ethics guidelines.
Training materials – Training documents can be used by Amazon Q to assist in employee onboarding, continuous education, and skill development.
Contracts and agreements – For legal and administrative functions, Amazon Q can reference specific clauses and terms from contracts to support negotiations or compliance checks.
Meeting minutes and reports – Summarized content of meetings and reports can help Amazon Q provide quick updates and insights about business operations or project statuses.
Emails and communications – Analyzing archived emails and other communications can enable Amazon Q to extract and summarize information, provide context for ongoing discussions, or retrieve historical data.
Financial documents – Financial statements, budgets, and audit reports can be sources for Amazon Q to answer queries related to financial performance and planning.

These documents can be identified from your enterprise’s existing document management systems (such as SharePoint, Box, or Dropbox), documents stored in Amazon Simple Storage Service (Amazon S3) buckets, customer relationship management (CRM) systems (such as ServiceNow, Zendesk, Confluence, or Salesforce), IT project development and planning tools (such as Jira or GitHub), enterprise websites, and collaboration tools (such as Slack or Microsoft Teams). Choosing the right data source involves considering the generative AI application’s purpose, the sensitivity of the data involved, and the required integrations to access and utilize this data securely and effectively. Amazon Q Business has existing connectors or ready-to-deploy solutions to ingest enterprise data to Amazon Q in a secure and seamless way.

Generate a list of queries

Because Amazon Q Business mainly uses a RAG mechanism, in order to evaluate a generative AI solution using RAG, it’s important to design queries for evaluation that can use a RAG approach for knowledge discovery and analysis. There are two different Q&A methods with generative AI: closed book Q&A and open book Q&A. Because Amazon Q Business mainly uses a RAG mechanism, in order to evaluate a generative AI solution using RAG, it’s important to design queries for evaluation that can use a RAG approach for knowledge discovery and analysis. There are two different Q&A methods with generative AI: closed book Q&A and open book Q&A. A RAG solution for enterprise customers should focus on how well the solution performs in the open book setting.

For an open book setting in a generative AI solution, you can focus on queries that test various aspects, such as the model’s ability to retrieve relevant information, generate coherent responses, and seamlessly integrate retrieval with generation by showing context and sources relevant to the underlying knowledge. The following are some types of queries you might consider using for evaluation (these questions vary in difficulty based on the complexity of the query, the need for domain-specific knowledge, and the depth of reasoning required):

Simple:

Factual accuracy – Queries that require the model to retrieve and synthesize factual information accurately. For example, “What are the current product lines of company XXX?”

Hard:

Domain-specific knowledge – Questions focused on specific domains, especially in legal documents, medicine, or technology, to evaluate the model’s effectiveness in specialized areas. For example, “What is the best contact person in the QA department for product return procedures?”
Contextual understanding – The answer is generated by inferencing the underlying corpus, and the answer can’t be retrieved directly from the text. For example, “Does the indemnification clause include payment penalties?”

Challenging (even harder questions):

Complex question answering – Questions that involve layers of inference, testing the model’s ability to handle complexity and context. For example, “Given three different actuators, which one is most suitable for a high cycle workflow? Please provide reasoning.”
Ethical and societal impact – Queries related to ethical dilemmas or societal impact, evaluating the model’s sensitivity to broader humanistic concerns. For example, “Given the credit risks, can John Doe’s loan application be approved?”

Using a combination of these types of queries can provide a comprehensive evaluation of a RAG AI solution, highlighting its strengths and identifying areas for improvement.

Select metrics for evaluation

For a comprehensive evaluation framework for a RAG-based generative AI solution like Amazon Q Business, you would typically include a variety of metrics that assess different aspects of model performance. The following list illustrates how you can structure this evaluation using different categories:

Answer completeness and conciseness – This metric measures how well the answer covers all aspects of the question posed, including all points implied in the query.
Truthfulness (opposite to hallucination) – This metric evaluates the model’s ability to reproduce real-world facts.
Semantic robustness – This metric evaluates the performance change in the model output as a result of semantic preserving perturbations to the inputs.
Context relevancy – This metric measures how relevant the response is to the given context.
Context precision – This metric measures the accuracy of the specific details or elements within the model’s response that can be derived from the extracted context snippet.
Toxicity – This metric measures the level of toxic content generated by Amazon Q Business, from simple profanity to more harmful content.

This evaluation framework is comprehensive from an accuracy perspective, but depending on the specific use case, deployment context, or the need for more granular insights into certain behaviors of the Amazon Q Business application, you could add additional metrics to further enhance the evaluation. Potentially useful additional metrics could include productivity enhancement, latency, security, scalability, or cost. Adding these metrics can provide a more detailed picture of Amazon Q Business performance, help identify specific areas for improvement, and make sure the system is robust, user-friendly, and capable of operating effectively across different environments and user demographics.

If you’re using a human evaluation process, you can use a scorecard template to help the evaluator assess an Amazon Q application in a systematic way.

Scoring system

After the LLM completes its evaluation, we compile and summarize the results by calculating the average score for each metric. Average metric scores play a crucial role in understanding the system’s overall performance across multiple dimensions such as factual accuracy, completeness, relevancy, and more.

Using an average score offers the following benefits:

Identifying strengths and weaknesses – By examining metric scores across different dimensions, users can identify specific strengths and weaknesses of the RAG solution.
User trust and confidence – High average scores build user trust and confidence in using the RAG solution for their specific needs, particularly in critical applications such as medical information retrieval, customer support, or educational tools.
Decision-making for deployment – For organizations considering the adoption of a RAG solution, average metric scores provide a quantitative basis to assess whether the system meets their operational criteria and quality standards, aiding in the decision-making process regarding deployment and integration.

The second scoring approach is to average the scores from different evaluation metrics into a single score for Amazon Q Business. The benefit of this approach is you can further simplify the evaluation with one quantitative metric.

Solution architecture

Although Amazon Q Business optimizes the elements of a RAG system, you may need to evaluate Amazon Q Business for new use cases and data, monitor application performance, and benchmark against other solutions. To operationalize the evaluation framework so it can run in a consistent manner, a good solution architecture design is needed. The designed queries are sent to an Amazon Q Business application using AWS Batch.

Improve Amazon Q Business response after the evaluation

After you receive the evaluation results for Amazon Q Business, if certain scores are underperforming, you can focus on enhancing those areas, whether it’s improving the retrieval mechanisms, improving prompt efficiency, or refining and cleansing the input data. You can use the insights gained to refine and improve the application’s responses through admin control, relevancy boosting to improve retrieval mechanisms, and prompting to get more accurate responses from the LLM.

Conclusion

In conclusion, we have discussed how to develop an evaluation framework for Amazon Q Business, starting from selecting a use case, preparing data, and using metrics to aid a human in the loop evaluation framework. We also provided a solution architecture to scale the evaluation in a more consistent manner, ready to transform your business with AI? Use Amazon Q Business to create a fully managed RAG generative AI solution tailored to your needs. Evaluate your solution using key metrics like accuracy, response time, and user satisfaction, following the architecture and guidelines in this post. Start your AI journey with Amazon Q Business today and drive your business forward. Visit Amazon Q Business to learn more.

About the Authors

Julia Hu, Amit Gupta, and Neil Desai are experts in the field of Artificial Intelligence and Machine Learning, with a combined experience of over 50 years. They are passionate about leveraging AI technologies to solve real-world problems and drive innovation in enterprise operations.

Introducing AI for customer service

Top Stories

Future job title: Digital Forest Ranger

Complete Machine Learning Roadmap by Programming Yearner | Sep 2024

Managed Backup & Disaster Recovery Guide – ACM Communications

Amazon Q Business – Framework for Evaluating Accuracy

Revolutionizing Enterprise Operations with Generative AI like Amazon Q Business

Different evaluation methods for generative AI solutions

Identify use cases that are suitable for Amazon Q Business

Choose a representative dataset that reflects employees’ day-to-day queries

Generate a list of queries

Select metrics for evaluation

Scoring system

Solution architecture

Improve Amazon Q Business response after the evaluation

Conclusion

About the Authors

Leave a Reply Cancel reply

Related Strories

Open-Source Data Observability: From Zero to Hero (Part 2) | Sezin Sezgin-Rummelsberger | Sep 2024

Data Visualization with Pie Charts in Matplotlib | Diana Rozenshteyn | Oct 2024

Creating an Interactive UI for Llamaindex Workflows | Lingzhen Chen | Sep 2024

Disability, Access, AI | Stephanie Kirmer | Sep 2024

Quick Links

Follow Socials

Introducing AI for customer service

Top Stories

Future job title: Digital Forest Ranger

Complete Machine Learning Roadmap by Programming Yearner | Sep 2024

Managed Backup & Disaster Recovery Guide – ACM Communications

Amazon Q Business – Framework for Evaluating Accuracy

Revolutionizing Enterprise Operations with Generative AI like Amazon Q Business

Different evaluation methods for generative AI solutions

Identify use cases that are suitable for Amazon Q Business

Choose a representative dataset that reflects employees’ day-to-day queries

Generate a list of queries

Select metrics for evaluation

Scoring system

Solution architecture

Improve Amazon Q Business response after the evaluation

Conclusion

About the Authors

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Open-Source Data Observability: From Zero to Hero (Part 2) | Sezin Sezgin-Rummelsberger | Sep 2024

Data Visualization with Pie Charts in Matplotlib | Diana Rozenshteyn | Oct 2024

Creating an Interactive UI for Llamaindex Workflows | Lingzhen Chen | Sep 2024

Disability, Access, AI | Stephanie Kirmer | Sep 2024

Get Insider Tips and Tricks in Our Newsletter!