Skip to content


Comparing GPT-4 and Claude 3 on long-context tasks

Why you should take benchmarks with a grain of salt

By Aman Dhesi

Anthropic released the Claude 3 family of models last month, claiming it beats GPT-4 on all benchmarks. But others disagreed with these claims.

So which one is it - is Claude 3 better than GPT-4 or not? And isn't the whole point of benchmarks to evaluate the models objectively and remove the guesswork?

The answer is... it depends. Which model is better depends on what task you use the model for, and what data you use it on.

Instead of relying on third-party benchmarks, as Hamel Husain suggests you should be evaluating models on your own, domain-specific data, with all its nuances and intricacies.

In this short blogpost, we'll evaluate Claude 3 and GPT-4 on a specific long-context extraction task. We'll do this comparison using Superpipe which makes it easy to swap in different models and compare them on accuracy, cost and speed.

All the code and data is available in this Colab notebook.

The task - long context extraction

For some types of tasks, we need LLMs with very long context windows. Currently the only LLMs with context windows longer than 100K are GPT-4 and the Claude 3 family of models.

Conventional wisdom suggests that the bigger and more expensive a model is, the more accurate it is on all tasks. Let's evaluate whether this is true on a specific task - extracting information from Wikipedia pages of famous people.

Given the wikipedia page of a (real) person, we'll use an LLM to extract their date of birth, whether or not they're still alive and if not, their cause of death.

long context comparison

We'll perform a single LLM call and pass in the entire contents of the Wikipedia page. Wikipedia pages of famous people can easily be more than 50k tokens in length, which is why only models with context windows longer than 100k are eligible for this task.

The data

Our dataset contains 49 data points, each containing 4 fields:

  • a wikipedia url
  • the person's true date of birth
  • whether they're still alive
  • if not alive, cause of death

long context comparison

The latter 3 fields are the labels, they're only used to evaluate the result of the LLM extraction. All the data can be found along with the code here.


The results are in:

  • The entire Claude 3 family outperforms GPT-4 on accuracy
  • Haiku and Sonnet are both significantly cheaper than GPT-4 (34x and 3x, respectively)
  • There's no accuracy benefit in using Opus over Sonnet.

Based on these results, I would deploy Sonnet if I mainly cared about accuracy and Haiku if I was cost-sensitive.

long context comparison

Using superpipe to compare models

Superpipe takes care of all the boilerplate when comparing models on tasks, including

  • Defining an eval function, including using LLMs to perform evaluation
  • Keeping track of token usage and latency
  • Error handling to make sure a single error doesn't tank your whole experiment

To learn more about how to use Superpipe, check out the docs

Introducing Superpipe

By Aman Dhesi and Ben Scharfstein

Superipe is a lightweight framework to build, evaluate and optimize LLM pipelines for structured outputs: data labeling, extraction, classification, and tagging. Evaluate pipelines on your own data and optimize models, prompts and other parameters for the best accuracy, cost, and speed.

Why we built Superpipe

For the past few months we've been helping companies with structured data problems like product classification and document extraction.

Through working on these projects, we noticed a few problems.

  1. Many companies were doing “vibe-check” engineering with LLMs and had no idea how accurate their prompts and pipelines were.
  2. Usually this was because they didn't have labeled data to evaluate on. Even when they did, their labels were often inaccurate.
  3. And they lacked tools to rigorously compare different approaches, prompts, models and parameters. This resulted in a mess of experiments spread across notebooks.

Despite these limitations, we were able to get very high quality results. First, we learned that LLMs are actually very good at classification and extraction when used correctly. Second, we came to the conclusion that multi-step techniques worked quite well on a wide variety of use cases and outperformed zero-shot prompting much of the time. And finally, we observed that there’s usually a lot of cost and speed headroom without a loss in quality if you use cheaper models for “easy” steps, and expensive models for “hard” steps.

After a few experiments and projects, our process became:

  1. Build and run a v1 pipeline with a powerful model like GPT-4 for each step.
  2. Evaluate our pipeline by manually labelling ground truth data.
  3. Optimize the pipeline over prompts, models, and other parameters.

We built Superpipe to productize the the process of building, evaluating and optimizing the multi-step LLM pipelines we built. With Superpipe, we’re able to build 10x cheaper pipelines, 10x faster.

Today we’re open-sourcing Superpipe under the MIT license so that you can build faster, better, and cheaper pipelines as well. You can view the source code here.

Superpipe helps engineers think like scientists

Venn diagram of Superpipe

As ML engineers (even before LLMs) we knew the importance of high quality ground truth data and proper evaluation metrics. However, what we learned is that those without experience building probabilistic systems hadn’t necessarily learned those lessons yet. Now that every engineer can use AI with a few lines of code, it’s important that engineers start thinking more like scientists.

To put it in traditional software engineering terms, Superpipe brings test driven development to LLM pipelines. You can think of each labeled data point as a unit test. You wouldn't ship traditional software without units tests and you shouldn't ship LLM software without evals.

Tests will help you evaluate accuracy, but that’s only half the equation. When building with LLMs, there’s generally a tradeoff between cost/speed and accuracy. On the same pipeline and prompts, cheaper models are generally faster and but less accurate.

However, pipelines aren’t static. You can vary prompts, augment with retrieval, chain LLMs, enrich with public information and more. Superpipe will help you iterate faster and build cheaper and more accurate classification and extraction pipelines. In many cases, you can skip straight from v1—>v6.

iterating with Superpipe

How it works

There are three steps to using Superpipe:

  1. Build - create a multistep Pipeline using Steps
  2. Evaluate - generate and label ground truth data. Evaluate your results on speed, cost and accuracy.
  3. Optimize - run a Grid Search over the parameters of your pipeline to understand the tradeoffs between models, prompts, and other parameters.

The result of this process is a rigorous understanding of the cost, speed, and accuracy tradeoffs between different approaches, conveniently presented to you right in your notebook.

grid search

Learn more

The best way to learn more about Superpipe by reading our docs, checking out our Github, or asking questions on our Discord.