Custom Evaluation functions¶
Superpipe can also be used to evaluate generative output by definining custom eval functions. There are two primary ways to evaluate the quality of generative output:
- Assertions
- Side-by-side comparisons
In this example, we'll ask GPT-4 to help us with both approaches.
from superpipe import *
import pandas as pd
from pydantic import BaseModel, Field
Defining our pipeline¶
The LLMStep
takes in a prompt, model and name and outputs a column with that name. In this case, we just want the LLM to tell us a joke.
joke_prompt = lambda row: f"""
Tell me a joke about {row['topic']}
"""
JokesStep = steps.LLMStep(
prompt=joke_prompt,
model=models.gpt35,
name="joke"
)
topics = ['Beets', 'Bears', 'Battlestar Gallactica']
jokes_df = pd.DataFrame(topics)
jokes_df.columns = ["topic"]
jokes_df
topic | |
---|---|
0 | Beets |
1 | Bears |
2 | Battlestar Gallactica |
Before we define our pipeline we need to create our evaluation function. We're pulling out the get_structured_llm_response
function from Superpipe
to make our lives easier. Our evaluation function just needs to return a boolean.
def evaluate_prompt(row):
return f"""
Is the following joke pretty funny? Your bar should be making a friend laugh out loud
{row['joke']}
Return a json object with a single boolean key called 'evaluation'
"""
def evaluate_joke(row):
return llm.get_structured_llm_response(evaluate_prompt(row), models.gpt4).content['evaluation']
Now let's run our simple pipeline
comedian = pipeline.Pipeline([
JokesStep,
], evaluation_fn=evaluate_joke)
comedian.run(jokes_df)
Running step joke...
100%|██████████| 3/3 [00:01<00:00, 1.91it/s]
0 True 1 True 2 False dtype: bool
topic | __joke__ | joke | __joke4__ | joke4 | |
---|---|---|---|---|---|
0 | Beets | {'input_tokens': 16, 'output_tokens': 17, 'inp... | I tried to make a beet pun, but it just didn't... | {'input_tokens': 16, 'output_tokens': 14, 'inp... | Why did the beet turn red?\n\nBecause it saw t... |
1 | Bears | {'input_tokens': 15, 'output_tokens': 13, 'inp... | Why did the bear dissolve in water?\n\nBecause... | {'input_tokens': 15, 'output_tokens': 15, 'inp... | Why don't bears wear socks?\n\nBecause they li... |
2 | Battlestar Gallactica | {'input_tokens': 19, 'output_tokens': 29, 'inp... | Why did the Cylon break up with his girlfriend... | {'input_tokens': 19, 'output_tokens': 22, 'inp... | Why did the Cylon buy an iPhone?\n\nBecause he... |
comedian.statistics
PipelineStatistics(score=0.6666666666666666, input_tokens=defaultdict(<class 'int'>, {}), output_tokens=defaultdict(<class 'int'>, {}), input_cost=0.0, output_cost=0.0, num_success=0, num_failure=0, total_latency=0.0)
It looks like our eval is working -- it thought 2/3 jokes were funny.
Our eval is extremely subjective right now and slight changes to the prompt will move our eval statistic dramatically.
Side-by-side evals¶
Side-by-sides are still subjective but are often more aligned with the choice an AI engineer is making. The question is often not "is this funny" but rather "which model is funnier".
Let's create a GPT-4 joke step to compare to.
JokesStep4 = steps.LLMStep(
prompt=joke_prompt,
model=models.gpt4,
name="joke4"
)
Now we create an evaluation function that compares the jokes side-by-side
def evaluate_side_by_side_prompt(row):
return f"""
You are given two jokes. Rate which is funnier:
joke 1: {row['joke']}
joke 2: {row['joke4']}
If joke 1 is funnier or they are similar, return false. If joke 2 is funnier return true.
Return a json object with a single boolean key called 'evaluation'
"""
def evaluate_side_by_side(row):
return llm.get_structured_llm_response(evaluate_side_by_side_prompt(row), models.gpt4).content['evaluation']
jokes_df[['joke', 'joke4']].values
array([["I tried to make a beet pun, but it just didn't turnip right.", 'Why did the beet turn red?\n\nBecause it saw the salad dressing!'], ['Why did the bear dissolve in water?\n\nBecause it was polar!', "Why don't bears wear socks?\n\nBecause they like to walk bear-foot!"], ['Why did the Cylon break up with his girlfriend? She kept telling him, "You can\'t Frak your way out of every problem!"', 'Why did the Cylon buy an iPhone?\n\nBecause he heard it comes with Siri-usly good voice recognition!']], dtype=object)
comedian = pipeline.Pipeline([
JokesStep,
JokesStep4,
], evaluation_fn=evaluate_side_by_side)
comedian.run(jokes_df)
Running step joke...
100%|██████████| 3/3 [00:01<00:00, 1.82it/s]
Running step joke4...
100%|██████████| 3/3 [00:03<00:00, 1.07s/it]
0 True 1 True 2 True dtype: bool
topic | __joke__ | joke | __joke4__ | joke4 | |
---|---|---|---|---|---|
0 | Beets | {'input_tokens': 16, 'output_tokens': 22, 'inp... | Why did the beet break up with the turnip? Bec... | {'input_tokens': 16, 'output_tokens': 14, 'inp... | Why did the beet turn red?\n\nBecause it saw t... |
1 | Bears | {'input_tokens': 15, 'output_tokens': 11, 'inp... | Why do bears have hairy coats?\n\nFur protection! | {'input_tokens': 15, 'output_tokens': 15, 'inp... | Why don't bears wear socks?\n\nBecause they li... |
2 | Battlestar Gallactica | {'input_tokens': 19, 'output_tokens': 18, 'inp... | Why did the Cylon break up with the toaster?\n... | {'input_tokens': 19, 'output_tokens': 26, 'inp... | Why did the Cylon go to Starbucks?\n\nBecause ... |
comedian.statistics
PipelineStatistics(score=1.0, input_tokens=defaultdict(<class 'int'>, {}), output_tokens=defaultdict(<class 'int'>, {}), input_cost=0.0, output_cost=0.0, num_success=0, num_failure=0, total_latency=0.0)
It seems like GPT-4 prefers its own jokes! Of course this isn't a rigourous evaluation but it should be enough to get you started evaluating generative outputs.