Question answering over multiple documents using LLM

June 4, 2023



Large language models (LLMs) have become increasingly popular these days and numerous applications have emerged. Personally, I built LearnAnything with my friend James during the last winter break by leveraging OpenAI's GPT-3 model (the latest deployment used GPT-4) to help people generate learning paths on any topic.

Although it's relatively simple to build something seemingly cool using LLMs, their real power comes when we augment it with external tools and knowledge. For example, we can integrate search engines and external knowledge resources to enable LLMs to access real-time information and provide more reliable sources in answering. Furthermore, some incorporates computer vision models to perform visual reasoning tasks [1, 2, 3] or use LLMs to generate complex program [4].

This increasing complexity in LLM pipelines has motivated a disucssion on task composability detailed in Chip Huyen's blog about LLM in production. For this blog, I built a simple LLM pipeline that uses OpenAI's GPT model to answer questions based on information from my own notes using LangChain and Chroma.

Conceptual Walkthrough

The task could be formulated as an information retrieval1 problem which consists of two parts:

  1. Part 1 (retrieving) - Fetch any relevant notes based on the input question.
  2. Part 2 (answering) - Pass those retrieved notes along with the question to the language model of your choice to generate a response. Here, an LLM like GPT-4 is used to extract and summarize answers using information from multiple notes instead of simply returning partial quotes. This can almost be seen as an "information aggregation layer".

Some challenges arise in both parts:

  1. How to ingest notes (all in Markdown) into a format for fast querying in part 1?
    1. Solution: encode all the notes to embedding space (chose to use OpenAI's embedding API)and then indexing them using Chroma, a fast open-source vector database (also called embedding database)
  2. How to design my prompt in part 2 such that the LLM can generate reliable answer?
    1. Solution: use LangChain's built-in chain RetrievalQAWithSourcesChain


Here's a look at final result. All we need is a few lines of code.


{'question': "What's the difference between word error rate (WER) and BLEU score?",
 'answer': ' Word Error Rate (WER) is a metric used to evaluate speech recognition models, while BLEU Score is a metric used to evaluate machine translation models. WER is computed using the Levenshtein Distance Algorithm, while BLEU Score is computed using a brevity penalty and n-gram precision.\n',
 'sources': 'Obsidian_DB/02-SlipBox/, Obsidian_DB/02-SlipBox/BLEU, Obsidian_DB/02-SlipBox/Levenshtein Distance, Obsidian_DB/02-SlipBox/ML Model'}

From here, we can see that the code is able to answer a semantic search query (i.e. the question) using information from multiple different notes. Without the need to read through all the notes one by one, now I can simply query my notes using questions written in natural language.

To further test the latency, I ran 20 different queries and on average it took 12.6 seconds to generate an answer. Most of the time was spent on inferencing using OpenAI's GPT API endpoint.

Code Walkthrough

Loading Documents (line:6-7)

loader = DirectoryLoader("Obsidian_DB/", glob="**/*.md", show_progress=True)
docs = loader.load()
  • Notes are directly copied from my Obsidian vault as it manages notes offline in a local folder. You can also export your notes from Notion or your choice of note taking apps. Markdown format works just fine and there is no need to convert them into plain text files.
  • LangChain supports loading data from multiple different formats (JSON, CSV, Email, EPub, etc.) and sources (Hacker News, HuggingFace dataset, Wikipedia, etc.). For more detail reference its collection of document loaders. We used DirectoryLoader to load all the .md files in Obsidian_DB/ directory shown in line 6 loader = DirectoryLoader("Obsidian_DB/", glob="**/*.md", show_progress=True). The output is a BaseLoader object which you can pass into VectorstoreIndexCreator to create an index using Chroma.

Warning: make sure you don't ingest notes that include passwords or any other sensitive information when creating the index to prevent potential prompt injection attack2

Creating an Index (line:10)

index = VectorstoreIndexCreator().from_loaders([loader])

There are three main steps going on inside VectorstoreIndexCreator:

  1. Splitting documents into chunks
  2. Creating embeddings for each document
  3. Storing documents and embeddings in a vectorstore

In code, it expands to this:

Querying (line:18)

index.query_with_sources("What's the difference between word error rate (WER) and BLEU score?")
  • LangChain's retriever interface connects documents (e.g. notes, webpages, wikis) with language models. Under the hood, above code creates a RetrievalQAWithSourcesChain using the index and run query (i.e. question) on the chain.
    • In the code below, we see that by default RetrievalQAWithSourcesChain uses OpenAI's GPT-3 model as its LLM.
    • You can find actual prompts used by LangChain in their source code.
  • The chain_type argument is mainly for fine tune the output performance. The LangChain library offers four different types of chains for question answering and summarization: stuff, map_reduce, refine, and map_rerank. The stuff chain is the simplest and quickest to use, while the map_reduce chain is more complex but offers better performance. The refine chain is used to improve the output of a previous chain, and the map_rerank chain is used to rerank the output of a previous chain based on additional criteria. The main difference between the stuff and map_reduce chains is that the stuff chain processes each document independently, while the map_reduce chain processes all documents together and then combines the results.


Thank you to Julien Zhu for making this post better. In addition, this post is inspired by LangChain's detailed tutorial on question answering over documents.