Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

A New approach to medical Question-answering with Knowledge Graphs

Why read this blog post?

Want some inspiration for your current LLM project? do you struggle getting the right answers from ChatGPT? Do you find GenAi (generative artificial intelligence) brilliant for general knowledge questions yet so incompetent for domain-specific tasks? This blog post should show you an interesting methodology for getting the best out of structured information and LLM capabilities. And it’s not RAG!

Based on the paper: KNOWLEDGE GRAPH BASED AGENT FOR COMPLEX, KNOWLEDGE-INTENSIVE QA IN MEDICINE .

This blog post tries to simplify the paper, make it more accessible and reduce its reading time.

Introduction

Huge development in LLMs drove the world crazy for the last two years. Artificial intelligence has grabbed a huge chunk of peoples’ attention since. Although LLMs are trained on vast knowledge bases, they can’t seem to perform very well on reasoning tasks and Knowledge-intensive queries. Shortcomings like these are inadvertently apparent in fields like Medicine and Medical question-answering. Medical reasoning involves making diagnostic and therapeutic decisions while also understanding the pathology of diseases. Also, keep in mind, that medical answers should bear no faults or mistakes due to the sensitive nature of this task. That motivated a team of researchers from Harvard, Imperial and Pfizer to develop a Knowledge Graph based agentic model tackling medical question answering.

The Nature of Medical Challenges

  • Medicine is a heavily specialized field that relies on cold hard factual and scientific knowledge.
  • Yet, answering medical questions requires also nuanced contextualization and interpretations of modifying factors present in the case in hand.
  • Answers therefore should be factually correct and contextually relevant.
  • Compared to other fields of expertise, medical reasoning is vertical.
Horizontal vs vertical reasoning
  • To keep it simple, horizontal reasoning is deducing facts by applying general principles on specific cases. Mostly present in physics and mathematics.
  • Vertical reasoning on the other hand relies mainly on analogies to create models and exemplars. Mostly present in Biomedical research and medicine.

“In clinical practice, the patient serves as an exemplar, with generalizations drawn from many overlapping disease models and similar patient populations”

  • Single medical question requires considerations of semantic dependencies across multiple medical concepts at the same time.
  • “Biomedical scientists do not rely on a single approach to reasoning; instead, they use various strategies, including rule-based, prototype-based, and case-based reasoning.”

Problems with current Solutions

Reasoning with LLMs

The solution that requires the least amount of work, for off-the-shelf LLMs, and that provides the smallest latency. For medical question answering, latency of the AI model Is not as important as Accuracy.

Main limitation:

  • Lack of ground knowledge base.
  • Absence of multi-source and multi-strategic reasoning.
  • Inability to infer upon granular subtilties important in a medical context.

Same problem persists even with finetuned Large Language Models.

Introducing Chain-Of-thought CoT (Here is a quick a read on Chain-of thoughts) improved LLMs reasoning capabilities. But in the face of a knowledge-intensive Task, LLMs+CoT can’t do to the job either.

RAG (Retrieval Augmented Generation) solutions

RAG “which follows a Retrieve-then-Answer paradigm” seems to be the go-to solution in these situations. (here is a quick read, RAG implementation: step-by-step guide). I want to have exact scientific answers and semantic understanding of questions provided by LLMs , Why not simply use a RAG and retrieve knowledge from a multi-source medical corpus? Shouldn’t that be enough? Well, NO here is why:

  • As mentioned by the authors , document quality and retrieved information quality have a deep impact on the generated answer accuracy. We wrote an entire blog post for this subject.
  • Data repositories and knowledge bases most used in this context are incomplete and contain incorrect information.
  • Lack of post-Retrieval verification.

Self-RAG and Adaptive-RAG can showcase improvements on basic RAG systems, but quality is always dependent on the quality of retrieved knowledge.

KG-models

What are Knowledge Graphs?
Source : https://community.atlassian.com/t5/Confluence-questions/Knowledge-graph/qaq-p/1565284

KGs represent knowledge in a graph format, where entities (nodes) are connected by relationships (edges). This structure allows for the representation of complex interrelations and hierarchies between different concepts. Relationships are semantic in nature.

For more information, here is a Blog Post explaining KG in-depth.

What are Knowledge Graph model?

KG-models are algorithms and neural network models applied to Knowledge Graphs to retrieve information, answer questions and analyze the relationships within a KG. They are the oldest method amongst them all.

Limitations are due to:

  • inability to handle unseen Nodes.
  •  incomplete knowledge within the graphs.
  • Focus on semantic dependencies, overlooking the rich structural information.
  • Retrieving solely on the presence of direct associations (edges) i.e “For instance, concepts representing two proteins with distinct biological roles may not be directly connected in the KG, even though these proteins share similar biological representations”

Modern Agentic landscape

The modern agentic AI landscape is transforming question-answering tasks by enabling autonomous systems that intelligently manage complex queries across various domains.

Key innovations include :

  • Knowledge Graph Integration: (Check what does knowledge graph mean above) AI agents leverage structured knowledge to enhance reasoning and ensure factual accuracy, making them adept at navigating intricate information landscapes.
  • LLMs for Contextual Understanding: Large language models enhance agents’ abilities to interpret and respond to nuanced questions, allowing for better handling of ambiguous or complex inquiries.
  • ReAct (Reason + Act): This framework deserves an post on its own (Luckily we made one just for you) empowers agents to reason through questions and take action based on their reasoning, facilitating dynamic interactions and iterative improvements in responses.
  • Dynamic Task Execution: Agents autonomously retrieve, synthesize, and apply relevant information from diverse sources, adapting in real-time to the context of each query.

By integrating these capabilities, agentic systems redefine question-answering, providing more accurate, context-aware, and adaptive responses that go beyond traditional approaches.

Novel methodology introduced

You finally arrived at the Beefy material!

Objectives

We want to develop a solution that:

  • “Consider complex associations between several medical concepts at the same time”
  • “Systematically integrate multi-source knowledge”
  • “Effectively verify and ground the retrieved information”

Approach

The authors proposed an LLM-powered agent framework relying on 4 key Actions.

Generate Action

Given a question set Q , this action take a question stem q as input, and generates, depending on the question type, a triplet T (h,r,t) , h:head(node),r:relation(edge) , t: tail (node) . i.e (HSPA8, interacts, DHDDS)

from extracted medical concepts that are related to the question via the LLM.

The authors distinguished between choice-aware questions (questions with answer candidates) and non-choice-aware questions.

Review action

This is where the fun starts, and where the authors were the most creative. This action is basically going to act as a judge assessing the correctness of the triplets generated in the previous action.

To better perform such a task, a finetuning method was adopted to ensure the model is able to find the triplet to Knowledge graph.

Triplets are first mapped to UMLS Code (Unified medical language System), a standardized set of health and biomedical vocabularies and standards.  This acts as a first layer of filtering, rejecting any hallucinations and made-up terms by the LLM. From the matched entities, a pretrained embedding is retrieved from the Knowledge Graph. Wait, triplets have embeddings now? Well yes, this is a bit of an old technique that enables semantic relationships capturing. Here they used a KG with embeddings generated by TransE (Translation Embedder, Google, 2013).

Finetuning hasn’t started yet but don’t lose patience, we’ll get there. Remember the initial triplet generated in the Generate action? Well, they prompted an LLM to generate a description dictionary of the triplet relations in natural language having the r : relation for keys and description templates for values. It will ensure that the LLM is able to grasp structural embeddings. The description is then tokenized and embedded. The embedded description tokens get aligned with the triplet embeddings (3 vectors with dimension d as an embedding dimension , that will get projected in the alignment phase to match the token embedding dimension).

Alignment is basically an attention head followed by a LayerNorm layer and 2 Fully connected layers. The aligned embeddings will be added as a prefix for the embeddings of prompt tokens. The embeddings will be fed into a trainable LLM that gets finetuned with LoRa (Low rank adaptation) for a next token prediction Loss. Only the structural and description embeddings are frozen. It is interesting to notice the beauty of alignment here, it captures the semantic dependencies while including the hard coded KG structured information.

The finetuned review architecture is now able to perform 2 interesting tasks:

  • Allows for Knowledge graph completion of missing nodes or edges(relations), which is not mentioned in the paper. Hence augmenting the KG.
  • Allows the LLM to integrate structural embedding of the KG, therefore augmenting the LLM. This what the authors used to make the LLM output either True or False for each triplet indicating its correctnes.
Revise action

Nothing special in this section.  Triplets deemed true are stored in V set. Those who are not correct or who didn’t exist in the KG got rejected in a F set. The Revise action will modify the false triplets until they pass the review action or exceed the max number of iterations.

Answer action

An LLM is prompted to use the most suitable set of true triplets to generate the answer.

Really important Comment: This is not RAG, the system is not retrieving information from a graph based on query, KG is used in this approach for reviewing purposes. It is the LLM that’s generating the triplets, and it’s not constrained to KG.

Results

New dataset Used: A benchmark called MedDDx is introduced, focusing on differential diagnosis (DDx). It includes 1,769 multi-choice QA samples divided into three difficulty levels: Basic, Intermediate, and Expert.

KGAREVION performance: KGAREVION outperforms baseline models, showing a 4.8% improvement in average accuracy on multiple-choice medical QA tasks. It excels in complex scenarios with multiple medical concepts and semantically similar answer candidates.

Open-ended reasoning: KGAREVION performs better without predefined answer choices, showcasing enhanced capabilities in realistic medical scenarios. It demonstrates strong open-ended reasoning skills, improving on complex tasks compared to multi-choice setups.

Review and Revise actions: The model’s accuracy improves by 3-9% when integrating a Review process, particularly benefiting more complex datasets like MedDDx. The Revise action also aids in correcting errors, refining performance.

Robustness to order sensitivity: KGAREVION is significantly less affected by the order and indexing of candidate answers, reducing accuracy loss compared to pure LLMs, showcasing its robustness in multi-choice QA setups.

Conclusion

I hope that reading this blog post helped you get inspired to have new ideas and new approaches to tackle your next project. It is not a cookie cutter method to improve question-answering performance, but rather treat it as a tool to add it to you arsenal.