We recently open-sourced an auto-evaluator tool for grading LLM question-answer chains. We are now releasing an open source, free to use hosted app and API to expand usability. Below we discuss a few opportunities to further improve this.
Document Question-Answering is a popular LLM use-case. LangChain makes it easy to assemble LLM components (e.g., models and retrievers) into chains that support question-answering: input documents are split into chunks and stored in a retriever, relevant chunks are retrieved given a user question
and passed to an LLM for synthesis into an answer
.
The quality of QA systems can vary considerably; we have seen cases of hallucination and poor answer quality due specific parameter settings. But, it is not always obvious to (1) evaluate the answer quality and (2) use this evaluation to guide improved QA chain settings (e.g., chunk size, retrieved docs count) or components (e.g., model or retriever choice).
The auto-evaluator
aims to address these limitations. It is inspired by work in two areas: 1) Anthropic has used model-written evaluation sets and 2) OpenAI has shown model-graded evaluation. This app combines both ideas into a single workspace, auto-generating a QA test set for a given input doc and auto-grading the result of the user-specified QA chain. Langchain’s abstractions make it easy to configure QA chain with modular components (in colors below).
We are now releasing an open source, free to use hosted app and API for evaluating QA chains. The app can be used in two ways (see the README for more details):
Demo
: We pre-loaded a document (a transcript of the Lex Fridman podcast with Andrej Karpathy) and a set of 5 question-answer pairs from the podcast. You can configure QA chain(s) and run experiments to evaluate the relative performance.Playground
: Inspired by the nat.dev
playground, a user can input a document to evaluate various QA chan(s) on. Optionally, a user can include a test set of question-answer pairs related to the document; see examples here and here.File handling
File transfer from client to back-end is slow. For 2 files (39MB), the transfer is ~40 sec:
Prod | Local | |
---|---|---|
OAI embedding | OAI embedding | |
Stage | Elapsed time | Elapsed time |
Transfer file | 37 sec | 0 sec |
Reading file | 5 sec | 1 sec |
Splitting docs | 3 sec | 3 sec |
Making LLM | 1 sec | 1 sec |
Make retriever | 6 sec | 2 sec |
Success | ✅ | ✅ |