Custom Evaluation functions¶

View on Github

Superpipe can also be used to evaluate generative output by definining custom eval functions. There are two primary ways to evaluate the quality of generative output:

Assertions
Side-by-side comparisons

In this example, we'll ask GPT-4 to help us with both approaches.

In [1]:

Copied!

from superpipe import *
import pandas as pd
from pydantic import BaseModel, Field
from superpipe import *
import pandas as pd
from pydantic import BaseModel, Field

Defining our pipeline¶

The LLMStep takes in a prompt, model and name and outputs a column with that name. In this case, we just want the LLM to tell us a joke.

In [2]:

Copied!





joke_prompt = lambda row: f"""
Tell me a joke about {row['topic']}
"""

JokesStep = steps.LLMStep(
  prompt=joke_prompt,
  model=models.gpt35,
  name="joke"
)
joke_prompt = lambda row: f"""
Tell me a joke about {row['topic']}
"""

JokesStep = steps.LLMStep(
  prompt=joke_prompt,
  model=models.gpt35,
  name="joke"
)

In [3]:

Copied!

topics = ['Beets', 'Bears', 'Battlestar Gallactica']
topics = ['Beets', 'Bears', 'Battlestar Gallactica']

In [4]:

Copied!

jokes_df = pd.DataFrame(topics)
jokes_df.columns = ["topic"]
jokes_df = pd.DataFrame(topics)
jokes_df.columns = ["topic"]

In [5]:

Copied!

jokes_df
jokes_df

Out[5]:

	topic
0	Beets
1	Bears
2	Battlestar Gallactica

Before we define our pipeline we need to create our evaluation function. We're pulling out the get_structured_llm_response function from Superpipe to make our lives easier. Our evaluation function just needs to return a boolean.

In [57]:

Copied!





def evaluate_prompt(row):
    return f"""
    Is the following joke pretty funny? Your bar should be making a friend laugh out loud
    {row['joke']}

    Return a json object with a single boolean key called 'evaluation'
    """

def evaluate_joke(row):
    return llm.get_structured_llm_response(evaluate_prompt(row), models.gpt4).content['evaluation']
def evaluate_prompt(row):
    return f"""
    Is the following joke pretty funny? Your bar should be making a friend laugh out loud
    {row['joke']}

    Return a json object with a single boolean key called 'evaluation'
    """

def evaluate_joke(row):
    return llm.get_structured_llm_response(evaluate_prompt(row), models.gpt4).content['evaluation']

Now let's run our simple pipeline

In [58]:

Copied!

comedian = pipeline.Pipeline([
    JokesStep,
], evaluation_fn=evaluate_joke)

comedian.run(jokes_df)
comedian = pipeline.Pipeline([
    JokesStep,
], evaluation_fn=evaluate_joke)

comedian.run(jokes_df)

Running step joke...

100%|██████████| 3/3 [00:01<00:00,  1.91it/s]

0     True
1     True
2    False
dtype: bool

Out[58]:

	topic	__joke__	joke	__joke4__	joke4
0	Beets	{'input_tokens': 16, 'output_tokens': 17, 'inp...	I tried to make a beet pun, but it just didn't...	{'input_tokens': 16, 'output_tokens': 14, 'inp...	Why did the beet turn red?\n\nBecause it saw t...
1	Bears	{'input_tokens': 15, 'output_tokens': 13, 'inp...	Why did the bear dissolve in water?\n\nBecause...	{'input_tokens': 15, 'output_tokens': 15, 'inp...	Why don't bears wear socks?\n\nBecause they li...
2	Battlestar Gallactica	{'input_tokens': 19, 'output_tokens': 29, 'inp...	Why did the Cylon break up with his girlfriend...	{'input_tokens': 19, 'output_tokens': 22, 'inp...	Why did the Cylon buy an iPhone?\n\nBecause he...

In [59]:

Copied!

comedian.statistics
comedian.statistics

Out[59]:

PipelineStatistics(score=0.6666666666666666, input_tokens=defaultdict(<class 'int'>, {}), output_tokens=defaultdict(<class 'int'>, {}), input_cost=0.0, output_cost=0.0, num_success=0, num_failure=0, total_latency=0.0)

It looks like our eval is working -- it thought 2/3 jokes were funny.

Our eval is extremely subjective right now and slight changes to the prompt will move our eval statistic dramatically.

Side-by-side evals¶

Side-by-sides are still subjective but are often more aligned with the choice an AI engineer is making. The question is often not "is this funny" but rather "which model is funnier".

Let's create a GPT-4 joke step to compare to.

In [29]:

Copied!





JokesStep4 = steps.LLMStep(
  prompt=joke_prompt,
  model=models.gpt4,
  name="joke4"
)
JokesStep4 = steps.LLMStep(
  prompt=joke_prompt,
  model=models.gpt4,
  name="joke4"
)

Now we create an evaluation function that compares the jokes side-by-side

In [61]:

Copied!





def evaluate_side_by_side_prompt(row):
    return f"""
    You are given two jokes. Rate which is funnier:
    joke 1: {row['joke']}
    joke 2: {row['joke4']}

    If joke 1 is funnier or they are similar, return false. If joke 2 is funnier return true.

    Return a json object with a single boolean key called 'evaluation'
    """

def evaluate_side_by_side(row):
    return llm.get_structured_llm_response(evaluate_side_by_side_prompt(row), models.gpt4).content['evaluation']
def evaluate_side_by_side_prompt(row):
    return f"""
    You are given two jokes. Rate which is funnier:
    joke 1: {row['joke']}
    joke 2: {row['joke4']}

    If joke 1 is funnier or they are similar, return false. If joke 2 is funnier return true.

    Return a json object with a single boolean key called 'evaluation'
    """

def evaluate_side_by_side(row):
    return llm.get_structured_llm_response(evaluate_side_by_side_prompt(row), models.gpt4).content['evaluation']

In [62]:

Copied!

jokes_df[['joke', 'joke4']].values
jokes_df[['joke', 'joke4']].values

Out[62]:

array([["I tried to make a beet pun, but it just didn't turnip right.",
        'Why did the beet turn red?\n\nBecause it saw the salad dressing!'],
       ['Why did the bear dissolve in water?\n\nBecause it was polar!',
        "Why don't bears wear socks?\n\nBecause they like to walk bear-foot!"],
       ['Why did the Cylon break up with his girlfriend? She kept telling him, "You can\'t Frak your way out of every problem!"',
        'Why did the Cylon buy an iPhone?\n\nBecause he heard it comes with Siri-usly good voice recognition!']],
      dtype=object)

In [63]:

Copied!





comedian = pipeline.Pipeline([
    JokesStep,
    JokesStep4,
], evaluation_fn=evaluate_side_by_side)

comedian.run(jokes_df)
comedian = pipeline.Pipeline([
    JokesStep,
    JokesStep4,
], evaluation_fn=evaluate_side_by_side)

comedian.run(jokes_df)

Running step joke...

100%|██████████| 3/3 [00:01<00:00,  1.82it/s]

Running step joke4...

100%|██████████| 3/3 [00:03<00:00,  1.07s/it]

0    True
1    True
2    True
dtype: bool

Out[63]:

	topic	__joke__	joke	__joke4__	joke4
0	Beets	{'input_tokens': 16, 'output_tokens': 22, 'inp...	Why did the beet break up with the turnip? Bec...	{'input_tokens': 16, 'output_tokens': 14, 'inp...	Why did the beet turn red?\n\nBecause it saw t...
1	Bears	{'input_tokens': 15, 'output_tokens': 11, 'inp...	Why do bears have hairy coats?\n\nFur protection!	{'input_tokens': 15, 'output_tokens': 15, 'inp...	Why don't bears wear socks?\n\nBecause they li...
2	Battlestar Gallactica	{'input_tokens': 19, 'output_tokens': 18, 'inp...	Why did the Cylon break up with the toaster?\n...	{'input_tokens': 19, 'output_tokens': 26, 'inp...	Why did the Cylon go to Starbucks?\n\nBecause ...

In [64]:

Copied!

comedian.statistics
comedian.statistics

Out[64]:

PipelineStatistics(score=1.0, input_tokens=defaultdict(<class 'int'>, {}), output_tokens=defaultdict(<class 'int'>, {}), input_cost=0.0, output_cost=0.0, num_success=0, num_failure=0, total_latency=0.0)

It seems like GPT-4 prefers its own jokes! Of course this isn't a rigourous evaluation but it should be enough to get you started evaluating generative outputs.