Auto-Evaluator Opportunities

TL;DR

We recently open-sourced an auto-evaluator tool for grading LLM question-answer chains. We are now releasing an open source, free to use hosted app and API to expand usability. Below we discuss a few opportunities to further improve this.

Context

Document Question-Answering is a popular LLM use-case. LangChain makes it easy to assemble LLM components (e.g., models and retrievers) into chains that support question-answering: input documents are split into chunks and stored in a retriever, relevant chunks are retrieved given a user question and passed to an LLM for synthesis into an answer.

Problem

The quality of QA systems can vary considerably; we have seen cases of hallucination and poor answer quality due specific parameter settings. But, it is not always obvious to (1) evaluate the answer quality and (2) use this evaluation to guide improved QA chain settings (e.g., chunk size, retrieved docs count) or components (e.g., model or retriever choice).

App

The auto-evaluator aims to address these limitations. It is inspired by work in two areas: 1) Anthropic has used model-written evaluation sets and 2) OpenAI has shown model-graded evaluation. This app combines both ideas into a single workspace, auto-generating a QA test set for a given input doc and auto-grading the result of the user-specified QA chain. Langchain’s abstractions make it easy to configure QA chain with modular components (in colors below).

Untitled

Usage

We are now releasing an open source, free to use hosted app and API for evaluating QA chains. The app can be used in two ways (see the README for more details):

Demo: We pre-loaded a document (a transcript of the Lex Fridman podcast with Andrej Karpathy) and a set of 5 question-answer pairs from the podcast. You can configure QA chain(s) and run experiments to evaluate the relative performance.
Playground: Inspired by the nat.dev playground, a user can input a document to evaluate various QA chan(s) on. Optionally, a user can include a test set of question-answer pairs related to the document; see examples here and here.

Untitled

Opportunities for improvement

File handling

File transfer from client to back-end is slow. For 2 files (39MB), the transfer is ~40 sec:

	Prod	Local
	OAI embedding	OAI embedding
Stage	Elapsed time	Elapsed time
Transfer file	37 sec	0 sec
Reading file	5 sec	1 sec
Splitting docs	3 sec	3 sec
Making LLM	1 sec	1 sec
Make retriever	6 sec	2 sec
Success	✅	✅