Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Want some inspiration for your current LLM project? do you struggle getting the right answers from ChatGPT? Do you find GenAi (generative artificial intelligence) brilliant for general knowledge questions yet so incompetent for domain-specific tasks? This blog post should show you an interesting methodology for getting the best out of structured information and LLM capabilities. And it’s not RAG!
Based on the paper: KNOWLEDGE GRAPH BASED AGENT FOR COMPLEX, KNOWLEDGE-INTENSIVE QA IN MEDICINE .
This blog post tries to simplify the paper, make it more accessible and reduce its reading time.
Huge development in LLMs drove the world crazy for the last two years. Artificial intelligence has grabbed a huge chunk of peoples’ attention since. Although LLMs are trained on vast knowledge bases, they can’t seem to perform very well on reasoning tasks and Knowledge-intensive queries. Shortcomings like these are inadvertently apparent in fields like Medicine and Medical question-answering. Medical reasoning involves making diagnostic and therapeutic decisions while also understanding the pathology of diseases. Also, keep in mind, that medical answers should bear no faults or mistakes due to the sensitive nature of this task. That motivated a team of researchers from Harvard, Imperial and Pfizer to develop a Knowledge Graph based agentic model tackling medical question answering.
“In clinical practice, the patient serves as an exemplar, with generalizations drawn from many overlapping disease models and similar patient populations”
The solution that requires the least amount of work, for off-the-shelf LLMs, and that provides the smallest latency. For medical question answering, latency of the AI model Is not as important as Accuracy.
Main limitation:
Same problem persists even with finetuned Large Language Models.
Introducing Chain-Of-thought CoT (Here is a quick a read on Chain-of thoughts) improved LLMs reasoning capabilities. But in the face of a knowledge-intensive Task, LLMs+CoT can’t do to the job either.
RAG “which follows a Retrieve-then-Answer paradigm” seems to be the go-to solution in these situations. (here is a quick read, RAG implementation: step-by-step guide). I want to have exact scientific answers and semantic understanding of questions provided by LLMs , Why not simply use a RAG and retrieve knowledge from a multi-source medical corpus? Shouldn’t that be enough? Well, NO here is why:
Self-RAG and Adaptive-RAG can showcase improvements on basic RAG systems, but quality is always dependent on the quality of retrieved knowledge.
KGs represent knowledge in a graph format, where entities (nodes) are connected by relationships (edges). This structure allows for the representation of complex interrelations and hierarchies between different concepts. Relationships are semantic in nature.
For more information, here is a Blog Post explaining KG in-depth.
KG-models are algorithms and neural network models applied to Knowledge Graphs to retrieve information, answer questions and analyze the relationships within a KG. They are the oldest method amongst them all.
Limitations are due to:
The modern agentic AI landscape is transforming question-answering tasks by enabling autonomous systems that intelligently manage complex queries across various domains.
Key innovations include :
By integrating these capabilities, agentic systems redefine question-answering, providing more accurate, context-aware, and adaptive responses that go beyond traditional approaches.
You finally arrived at the Beefy material!
We want to develop a solution that:
- “Consider complex associations between several medical concepts at the same time”
- “Systematically integrate multi-source knowledge”
- “Effectively verify and ground the retrieved information”
The authors proposed an LLM-powered agent framework relying on 4 key Actions.
Given a question set Q , this action take a question stem q as input, and generates, depending on the question type, a triplet T (h,r,t) , h:head(node),r:relation(edge) , t: tail (node) . i.e (HSPA8, interacts, DHDDS)
from extracted medical concepts that are related to the question via the LLM.
The authors distinguished between choice-aware questions (questions with answer candidates) and non-choice-aware questions.
This is where the fun starts, and where the authors were the most creative. This action is basically going to act as a judge assessing the correctness of the triplets generated in the previous action.
To better perform such a task, a finetuning method was adopted to ensure the model is able to find the triplet to Knowledge graph.
Triplets are first mapped to UMLS Code (Unified medical language System), a standardized set of health and biomedical vocabularies and standards. This acts as a first layer of filtering, rejecting any hallucinations and made-up terms by the LLM. From the matched entities, a pretrained embedding is retrieved from the Knowledge Graph. Wait, triplets have embeddings now? Well yes, this is a bit of an old technique that enables semantic relationships capturing. Here they used a KG with embeddings generated by TransE (Translation Embedder, Google, 2013).
Finetuning hasn’t started yet but don’t lose patience, we’ll get there. Remember the initial triplet generated in the Generate action? Well, they prompted an LLM to generate a description dictionary of the triplet relations in natural language having the r : relation for keys and description templates for values. It will ensure that the LLM is able to grasp structural embeddings. The description is then tokenized and embedded. The embedded description tokens get aligned with the triplet embeddings (3 vectors with dimension d as an embedding dimension , that will get projected in the alignment phase to match the token embedding dimension).
Alignment is basically an attention head followed by a LayerNorm layer and 2 Fully connected layers. The aligned embeddings will be added as a prefix for the embeddings of prompt tokens. The embeddings will be fed into a trainable LLM that gets finetuned with LoRa (Low rank adaptation) for a next token prediction Loss. Only the structural and description embeddings are frozen. It is interesting to notice the beauty of alignment here, it captures the semantic dependencies while including the hard coded KG structured information.
The finetuned review architecture is now able to perform 2 interesting tasks:
Nothing special in this section. Triplets deemed true are stored in V set. Those who are not correct or who didn’t exist in the KG got rejected in a F set. The Revise action will modify the false triplets until they pass the review action or exceed the max number of iterations.
An LLM is prompted to use the most suitable set of true triplets to generate the answer.
Really important Comment: This is not RAG, the system is not retrieving information from a graph based on query, KG is used in this approach for reviewing purposes. It is the LLM that’s generating the triplets, and it’s not constrained to KG.
New dataset Used: A benchmark called MedDDx is introduced, focusing on differential diagnosis (DDx). It includes 1,769 multi-choice QA samples divided into three difficulty levels: Basic, Intermediate, and Expert.
KGAREVION performance: KGAREVION outperforms baseline models, showing a 4.8% improvement in average accuracy on multiple-choice medical QA tasks. It excels in complex scenarios with multiple medical concepts and semantically similar answer candidates.
Open-ended reasoning: KGAREVION performs better without predefined answer choices, showcasing enhanced capabilities in realistic medical scenarios. It demonstrates strong open-ended reasoning skills, improving on complex tasks compared to multi-choice setups.
Review and Revise actions: The model’s accuracy improves by 3-9% when integrating a Review process, particularly benefiting more complex datasets like MedDDx. The Revise action also aids in correcting errors, refining performance.
Robustness to order sensitivity: KGAREVION is significantly less affected by the order and indexing of candidate answers, reducing accuracy loss compared to pure LLMs, showcasing its robustness in multi-choice QA setups.
I hope that reading this blog post helped you get inspired to have new ideas and new approaches to tackle your next project. It is not a cookie cutter method to improve question-answering performance, but rather treat it as a tool to add it to you arsenal.