Lance Martin

TL;DR

We recently open-sourced an auto-evaluator tool for grading LLM question-answer chains. We are now releasing an open source, free to use hosted app and API to expand usability. Below we discuss a few opportunities to further improve this.

Context

Document Question-Answering is a popular LLM use-case. LangChain makes it easy to assemble LLM components (e.g., models and retrievers) into chains that support question-answering: input documents are split into chunks and stored in a retriever, relevant chunks are retrieved given a user question and passed to an LLM for synthesis into an answer.

Problem

The quality of QA systems can vary considerably; we have seen cases of hallucination and poor answer quality due specific parameter settings. But, it is not always obvious to (1) evaluate the answer quality and (2) use this evaluation to guide improved QA chain settings (e.g., chunk size, retrieved docs count) or components (e.g., model or retriever choice).

App

The auto-evaluator aims to address these limitations. It is inspired by work in two areas: 1) Anthropic has used model-written evaluation sets and 2) OpenAI has shown model-graded evaluation. This app combines both ideas into a single workspace, auto-generating a QA test set for a given input doc and auto-grading the result of the user-specified QA chain. Langchain’s abstractions make it easy to configure QA chain with modular components (in colors below).

Untitled

Usage

We are now releasing an open source, free to use hosted app and API for evaluating QA chains. The app can be used in two ways (see the README for more details):

Untitled

Opportunities for improvement

File handling

File transfer from client to back-end is slow. For 2 files (39MB), the transfer is ~40 sec:

Prod Local
OAI embedding OAI embedding
Stage Elapsed time Elapsed time
Transfer file 37 sec 0 sec
Reading file 5 sec 1 sec
Splitting docs 3 sec 3 sec
Making LLM 1 sec 1 sec
Make retriever 6 sec 2 sec
Success