Comparing different approaches¶
There are many ways to build a labeling pipeline that all will accomplish the same result. The goal of Superpipe
is to empower rapid and robust experimentation so that you can understand the performance, accuracy, and cost tradeoffs between approaches.
In this example, we'll experiment with a few different approaches to a categorization pipeline we want to build. Superpipe
will make this experimentation quick and at the end we'll have a solid understanding of how different approaches perform.
Task¶
The task at hand is to categorize furniture items into a multi-level taxonomy based on their name and description.
For example
Name: Blair Table by homestyles
Description: This Blair Table by homestyles is perfect for Sunday brunches or game night. The round pedestal table is available as shown, or as part of a five-piece set. Features solid hardwood construction in a black finish that can easily match a traditional or contemporary aesthetic. Measures: 30"H x 42" Diameter
Correct classification: Tables & Desks > Dining Tables
Approaches¶
There are two different approaches we want to try.
- LLMs + Embedding
- Heiarchical prompting
from dotenv import load_dotenv
load_dotenv()
import os
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
COHERE_API_KEY = os.getenv('COHERE_API_KEY')
# %pip install cohere
import pandas as pd
from superpipe import *
from pydantic import BaseModel, Field
import cohere
import os
import numpy as np
from typing import List
Data processing¶
We'll start out with reading in our data and building our taxonomy. The process of building a taxonomy is a project in and of itself. There are also many taxonomies available online that you can use. In our case, we're building our taxonomy based on our ground truth dataset. Since we have such a large dataset we can be reasonably confident that all values are represented. As you'll see our approach does not use the ground truth data as training data so it will be easy for us to expand the taxonomy without needing additional data.
df = pd.read_csv('./furniture_clean.csv')
# Remove the 'Furniture > ' from each string in the 'category' column since they all start with Furniture.
df['category_new'] = df['category'].str.replace('Furniture > ', '')
For our embeddings approach we want the taxonomy to be a single string. We'll create the taxonomy from the ground truth data.
taxonomy = list(set(df['category_new']))
taxonomy[0:5]
['Outdoor Tables > Outdoor Coffee Tables', 'Chairs > Dining Chairs', 'Tables & Desks > Bar Carts', 'Chairs > Accent Chairs', 'Chairs > Desk Chairs']
However, for our heiarchical approach we need to understand the taxonomy a little more so we'll create a lookup table between first and second level categories.
# Create a lookup table with first level taxonomy as keys and second level as values
lookup_table = df['category_new'].str.split(' > ', expand=True).groupby(0)[1].apply(list).apply(set)
lookup_table['Chairs']
{'Accent Chairs', 'Desk Chairs', 'Dining Chairs', 'Recliners'}
Building our pipeline using Superpipe¶
Approach 1: Embeddings¶
The first approach is similar to the approach we took in the Product Categorization
example we gave in the project repo. We are omitting the Google Search step because we already have item descriptions.
- Write a simple description of the product given name and description
- Vector embedding search for top N categories
- LLM: pick the best category
short_description_prompt = lambda row: f"""
You are given a product name and description for a piece of furniture.
Return a single sentence decribing the product.
Product name: {row['name']}
Product description: {row['description']}
"""
class ShortDescription(BaseModel):
short_description: str = Field(description="A single sentence describing the product")
short_description_step = steps.LLMStructuredStep(
prompt=short_description_prompt,
model=models.gpt35,
out_schema=ShortDescription,
name="short_description"
)
We are using Cohere to embed both our description and the taxonomy but you can substitute in any embeddings provider with the EmbeddingSearchStep
. Unlike LLMs that are good at ignoring irrelevant information, we've learned from experience that short, simple descriptions work better in embedding space than trying to include too much. This is something you can and should experiment with.
# set your cohere api key as an env var or set it directly here
COHERE_API_KEY = os.environ.get('COHERE_API_KEY')
co = cohere.Client(COHERE_API_KEY)
def embed_fn(texts: List[str]):
embeddings = co.embed(
model="embed-english-v3.0",
texts=texts,
input_type='classification'
).embeddings
return np.array(embeddings).astype('float32')
embedding_search_prompt = lambda row: row["short_description"]
embedding_search_step = steps.EmbeddingSearchStep(
search_prompt= embedding_search_prompt,
embed_fn=embed_fn,
k=5,
candidates=taxonomy,
name="embedding_search"
)
We now take the result of the embeddings and ask the LLM to pick the best response. It's important that our embedding search is optimized for recall because if the correct answer doesn't exist in the response our categorize step will have no chance of succeeding.
def categorize_prompt(row):
categories = ""
i = 1
while f"candidate{i}" in row:
categories += f'{i}. {row["embedding_search"][f"candidate{i}"]}\n'
i += 1
return f"""
You are given a product description and {i-1} options for the product's category.
Pick the index of the most accurate category.
The index must be between 1 and {i-1}.
Product description: {row['short_description']}
Categories:
{categories}
"""
class CategoryIndex(BaseModel):
category_index: int = Field(description="The index of the most accurate category")
categorize_step = steps.LLMStructuredStep(
prompt=categorize_prompt,
model=models.gpt35,
out_schema=CategoryIndex,
name="categorize"
)
By returning just the index we can ensure that the actual string we use is in the taxonomy since LLMs sometimes hallucinate characters. Additionally, we don't need to waste response tokens on printing the entire string.
predicated_category_step = steps.CustomStep(
transform=lambda row: row["embedding_search"][f'candidate{row["category_index"]}'],
name="predicted_category"
)
We'd like to test our end to end pipeline to make sure it works before we go any further. We'll make a copy of the first five rows of the dataframe and run the pipeline to make sure it works
test_df = df.head(5).copy()
evaluate = lambda row: row['predicted_category'].lower() == row['category_new'].lower()
categorizer = pipeline.Pipeline([
short_description_step,
embedding_search_step,
categorize_step,
predicated_category_step
], evaluation_fn=evaluate)
categorizer.run(test_df)
Applying step short_description: 0%| | 0/5 [00:00<?, ?it/s]
Applying step short_description: 100%|██████████| 5/5 [00:08<00:00, 1.60s/it] Applying step embedding_search: 100%|██████████| 5/5 [00:00<00:00, 7.53it/s] Applying step categorize: 100%|██████████| 5/5 [00:02<00:00, 1.72it/s] Applying step predicted_category: 100%|██████████| 5/5 [00:00<00:00, 9271.23it/s]
name | description | category | brand.name | category_new | __short_description__ | short_description | category1 | category2 | category3 | category4 | category5 | __categorize__ | category_index | predicted_category | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | EnGauge Deluxe Bedframe | Introducing the Engauge Deluxe Bedframe - the ... | Furniture > Beds & Headboards > Bedframes | NaN | Beds & Headboards > Bedframes | {'input_tokens': 313, 'output_tokens': 62, 'in... | Introducing the EnGauge Deluxe Bedframe - a st... | Beds & Headboards > Bedframes | Beds & Headboards > Beds | Beds & Headboards > Headboards | Mattresses & Box Springs > Mattresses | Mattresses & Box Springs > Box Springs & Found... | {'input_tokens': 208, 'output_tokens': 10, 'in... | 1 | Beds & Headboards > Bedframes |
1 | Sparrow & Wren Sullivan King Channel-Stitched ... | 85"L x 83"W x 56"H | Total weight: 150 lbs. | ... | Furniture > Beds & Headboards > Beds | Sparrow & Wren | Beds & Headboards > Beds | {'input_tokens': 169, 'output_tokens': 68, 'in... | Handcrafted Sparrow & Wren Sullivan King Chann... | Beds & Headboards > Headboards | Beds & Headboards > Beds | Beds & Headboards > Bedframes | Kids Beds & Headboards > Kid's Beds | Mattresses & Box Springs > Mattresses | {'input_tokens': 213, 'output_tokens': 10, 'in... | 2 | Beds & Headboards > Beds |
2 | Queen Bed With Frame | Dimensions:Head Board -49H x 63.75W x 1.5DFoot... | Furniture > Beds & Headboards > Beds | Hillsdale | Beds & Headboards > Beds | {'input_tokens': 124, 'output_tokens': 60, 'in... | The Queen Bed With Frame features a stylish de... | Beds & Headboards > Bedframes | Beds & Headboards > Headboards | Beds & Headboards > Beds | Kids Beds & Headboards > Kid's Beds | Sets > Bedroom Furniture Sets | {'input_tokens': 202, 'output_tokens': 10, 'in... | 1 | Beds & Headboards > Bedframes |
3 | Dylan Queen Bed | Add a touch of a modern farmhouse to your bedr... | Furniture > Beds & Headboards > Beds | NaN | Beds & Headboards > Beds | {'input_tokens': 140, 'output_tokens': 49, 'in... | Add a touch of modern farmhouse to your bedroo... | Beds & Headboards > Headboards | Beds & Headboards > Beds | Beds & Headboards > Bedframes | Sets > Bedroom Furniture Sets | Kids Beds & Headboards > Kid's Beds | {'input_tokens': 191, 'output_tokens': 10, 'in... | 2 | Beds & Headboards > Beds |
4 | Sparrow & Wren Mara Full Diamond-Tufted Bed | 78"L x 56"W x 51"H | Total weight: 130 lbs. | ... | Furniture > Beds & Headboards > Beds | Sparrow & Wren | Beds & Headboards > Beds | {'input_tokens': 168, 'output_tokens': 97, 'in... | The Sparrow & Wren Mara Full Diamond-Tufted Be... | Beds & Headboards > Headboards | Beds & Headboards > Beds | Beds & Headboards > Bedframes | Kids Beds & Headboards > Kid's Beds | Sets > Bedroom Furniture Sets | {'input_tokens': 236, 'output_tokens': 10, 'in... | 2 | Beds & Headboards > Beds |
Let's print our pipeline statistics and see how it's doing
print(categorizer.statistics)
+---------------+------------------------------+ | score | 0.8 | +---------------+------------------------------+ | input_tokens | {'gpt-3.5-turbo-0125': 1964} | +---------------+------------------------------+ | output_tokens | {'gpt-3.5-turbo-0125': 386} | +---------------+------------------------------+ | input_cost | $0.0009819999999999998 | +---------------+------------------------------+ | output_cost | $0.000579 | +---------------+------------------------------+ | num_success | 5 | +---------------+------------------------------+ | num_failure | 0 | +---------------+------------------------------+ | total_latency | 10.87322429305641 | +---------------+------------------------------+
Our pipeline is doing well but that's only on 5 data points. Let's try it on a few more.
test_df100 = df.head(100).copy()
categorizer.run(test_df100)
Applying step short_description: 100%|██████████| 100/100 [02:10<00:00, 1.31s/it] Applying step embedding_search: 100%|██████████| 100/100 [00:14<00:00, 7.09it/s] Applying step categorize: 100%|██████████| 100/100 [00:50<00:00, 2.00it/s] Applying step predicted_category: 100%|██████████| 100/100 [00:00<00:00, 25426.19it/s]
name | description | category | brand.name | category_new | __short_description__ | short_description | category1 | category2 | category3 | category4 | category5 | __categorize__ | category_index | predicted_category | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | EnGauge Deluxe Bedframe | Introducing the Engauge Deluxe Bedframe - the ... | Furniture > Beds & Headboards > Bedframes | NaN | Beds & Headboards > Bedframes | {'input_tokens': 313, 'output_tokens': 52, 'in... | Introducing the EnGauge Deluxe Bedframe, a stu... | Beds & Headboards > Bedframes | Beds & Headboards > Beds | Beds & Headboards > Headboards | Mattresses & Box Springs > Mattresses | Sets > Bedroom Furniture Sets | {'input_tokens': 193, 'output_tokens': 10, 'in... | 1 | Beds & Headboards > Bedframes |
1 | Sparrow & Wren Sullivan King Channel-Stitched ... | 85"L x 83"W x 56"H | Total weight: 150 lbs. | ... | Furniture > Beds & Headboards > Beds | Sparrow & Wren | Beds & Headboards > Beds | {'input_tokens': 169, 'output_tokens': 63, 'in... | The Sparrow & Wren Sullivan King Channel-Stitc... | Beds & Headboards > Headboards | Beds & Headboards > Beds | Beds & Headboards > Bedframes | Kids Beds & Headboards > Kid's Beds | Mattresses & Box Springs > Mattresses | {'input_tokens': 207, 'output_tokens': 10, 'in... | 2 | Beds & Headboards > Beds |
2 | Queen Bed With Frame | Dimensions:Head Board -49H x 63.75W x 1.5DFoot... | Furniture > Beds & Headboards > Beds | Hillsdale | Beds & Headboards > Beds | {'input_tokens': 124, 'output_tokens': 55, 'in... | Queen Bed With Frame featuring a Head Board me... | Beds & Headboards > Bedframes | Beds & Headboards > Headboards | Beds & Headboards > Beds | Kids Beds & Headboards > Kid's Beds | Sets > Bedroom Furniture Sets | {'input_tokens': 197, 'output_tokens': 10, 'in... | 3 | Beds & Headboards > Beds |
3 | Dylan Queen Bed | Add a touch of a modern farmhouse to your bedr... | Furniture > Beds & Headboards > Beds | NaN | Beds & Headboards > Beds | {'input_tokens': 140, 'output_tokens': 40, 'in... | Add a touch of modern farmhouse charm to your ... | Beds & Headboards > Headboards | Beds & Headboards > Beds | Beds & Headboards > Bedframes | Sets > Bedroom Furniture Sets | Kids Beds & Headboards > Kid's Beds | {'input_tokens': 182, 'output_tokens': 10, 'in... | 2 | Beds & Headboards > Beds |
4 | Sparrow & Wren Mara Full Diamond-Tufted Bed | 78"L x 56"W x 51"H | Total weight: 130 lbs. | ... | Furniture > Beds & Headboards > Beds | Sparrow & Wren | Beds & Headboards > Beds | {'input_tokens': 168, 'output_tokens': 54, 'in... | The Sparrow & Wren Mara Full Diamond-Tufted Be... | Beds & Headboards > Headboards | Beds & Headboards > Beds | Beds & Headboards > Bedframes | Mattresses & Box Springs > Mattresses | Sets > Bedroom Furniture Sets | {'input_tokens': 194, 'output_tokens': 10, 'in... | 2 | Beds & Headboards > Beds |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | Modway Melanie Tufted Button Upholstered Fabri... | Twin | Clean lines, a straightforward profile,... | Furniture > Beds & Headboards > Beds | Modway | Beds & Headboards > Beds | {'input_tokens': 225, 'output_tokens': 77, 'in... | The Modway Melanie Tufted Button Upholstered F... | Beds & Headboards > Headboards | Beds & Headboards > Beds | Mattresses & Box Springs > Mattresses | Beds & Headboards > Bedframes | Sets > Bedroom Furniture Sets | {'input_tokens': 218, 'output_tokens': 10, 'in... | 2 | Beds & Headboards > Beds |
96 | Concord Queen Panel Bed | Looking for a new bed that has it all? Check o... | Furniture > Beds & Headboards > Beds | Daniel's Amish | Beds & Headboards > Beds | {'input_tokens': 205, 'output_tokens': 55, 'in... | The Concord Queen Panel Bed is a contemporary ... | Beds & Headboards > Headboards | Beds & Headboards > Beds | Beds & Headboards > Bedframes | Kids Beds & Headboards > Kid's Beds | Sets > Bedroom Furniture Sets | {'input_tokens': 197, 'output_tokens': 11, 'in... | 2 | Beds & Headboards > Beds |
97 | Sparrow & Wren Myers King Bed | Dimensions: 85"L x 82"W x 56"H | Headboard hei... | Furniture > Beds & Headboards > Beds | Sparrow & Wren | Beds & Headboards > Beds | {'input_tokens': 271, 'output_tokens': 64, 'in... | The Sparrow & Wren Myers King Bed is a luxurio... | Beds & Headboards > Headboards | Beds & Headboards > Beds | Beds & Headboards > Bedframes | Kids Beds & Headboards > Kid's Beds | Mattresses & Box Springs > Mattresses | {'input_tokens': 209, 'output_tokens': 10, 'in... | 2 | Beds & Headboards > Beds |
98 | Loden Beige 3 Pc Queen Upholstered Bed with 2 ... | A classic design and sophisticated silhouette ... | Furniture > Beds & Headboards > Beds | Rooms To Go | Beds & Headboards > Beds | {'input_tokens': 181, 'output_tokens': 56, 'in... | The Loden Beige 3 Pc Queen Upholstered Bed wit... | Beds & Headboards > Headboards | Beds & Headboards > Beds | Storage > Dressers | Storage > Nightstands | Beds & Headboards > Bedframes | {'input_tokens': 192, 'output_tokens': 10, 'in... | 2 | Beds & Headboards > Beds |
99 | Hempstead Captain Bed in Graystone by A-America | Hempstead Captain Bed | Furniture > Beds & Headboards > Beds | A-America | Beds & Headboards > Beds | {'input_tokens': 97, 'output_tokens': 33, 'inp... | The Hempstead Captain Bed in Graystone by A-Am... | Beds & Headboards > Headboards | Beds & Headboards > Beds | Beds & Headboards > Bedframes | Sets > Bedroom Furniture Sets | Kids Beds & Headboards > Kid's Beds | {'input_tokens': 175, 'output_tokens': 10, 'in... | 2 | Beds & Headboards > Beds |
100 rows × 15 columns
print(categorizer.statistics)
+---------------+-------------------------------+ | score | 0.91 | +---------------+-------------------------------+ | input_tokens | {'gpt-3.5-turbo-0125': 39747} | +---------------+-------------------------------+ | output_tokens | {'gpt-3.5-turbo-0125': 6728} | +---------------+-------------------------------+ | input_cost | $0.019873500000000002 | +---------------+-------------------------------+ | output_cost | $0.010092000000000002 | +---------------+-------------------------------+ | num_success | 100 | +---------------+-------------------------------+ | num_failure | 0 | +---------------+-------------------------------+ | total_latency | 191.24415755318478 | +---------------+-------------------------------+
At current gpt-3.5-turbo pricing this batch of 100 requests cost $0.030291 and took five minutes and a half minutes to run for 90% accuracy. Let's see how hierarchical prompting does.
Approach 2: hierarchical prompting¶
Next we want to try forgoing embeddings all together and simply stuffing all of the categories into the prompt. There are too many categories to do this all in one go but we can use the fact that our categories are hierarchical and take a step by step approach.
- LLM: given product name, description, and first level categories, pick the best one.
- LLM: given product name, description, and second level categories, pick the best one.
We may want to iterate a bit on this process. For example, we may want to use one model in step 1 and a different model in step 2. Superpipe
makes this type of hyperparameter tuning easy and robust.
In our first step we're just asking the model to pick the right top level category. This is a relatively easy task if the categories are non-overlapping or can be very difficult if there are multiple correct answers. We'll only know by trying and inspecting our losses.
first_level_categories = list(lookup_table.keys())
def first_level_category_prompt(row):
i = len(first_level_categories)
return f"""
You are given a product name, description and {i} options for the product's top level category.
Pick the index of the most accurate category.
The index must be between 1 and {i}.
Product description: {row['description']}
Product name: {row['name']}
Categories:
{first_level_categories}
"""
class FirstLevelCategoryIndex(BaseModel):
first_category_index: int = Field(description="The index of the most accurate first level category")
first_level_category_step = steps.LLMStructuredStep(
prompt=first_level_category_prompt,
model=models.gpt35,
out_schema=FirstLevelCategoryIndex,
name="first_categorize"
)
select_first_category_step = steps.CustomStep(
transform=lambda row: first_level_categories[row["first_category_index"] - 1],
name="predicted_first_category"
)
Next we'll give the second layer of the taxonomy to the model to classify. Just as before are trying to predict the index to make sure our final output is valid.
def second_level_category_prompt(row):
second_level_categories = list(lookup_table[row['predicted_first_category']])
i = len(second_level_categories)
return f"""
You are given a product name, description, first level category
and {i} options for the product's second level category.
Pick the index of the most accurate category.
The index must be between 1 and {i}.
Product description: {row['description']}
Product name: {row['name']}
First level category: {row['predicted_first_category']}
Categories:
{second_level_categories}
"""
class SecondLevelCategoryIndex(BaseModel):
second_category_index: int = Field(description="The index of the most accurate second level category")
second_level_category_step = steps.LLMStructuredStep(
prompt=second_level_category_prompt,
model=models.gpt35,
out_schema=SecondLevelCategoryIndex,
name="second_categorize"
)
select_second_category_step = steps.CustomStep(
transform=lambda row: list(lookup_table[row['predicted_first_category']])[row["second_category_index"] - 1],
name="predicted_second_category"
)
Let's combine our results so we can properly compare to our ground truth column.
combine_taxonomy_step = steps.CustomStep(
transform=lambda row: f"{row['predicted_first_category']} > {row['predicted_second_category']}",
name='combine_taxonomy'
)
test_df2 = df.head(5).copy()
evaluate2 = lambda row: row['combine_taxonomy'].lower() == row['category_new'].lower()
categorizer_llm = pipeline.Pipeline([
first_level_category_step,
select_first_category_step,
second_level_category_step,
select_second_category_step,
combine_taxonomy_step
], evaluation_fn=evaluate2)
categorizer_llm.run(test_df2)
Applying step first_categorize: 100%|██████████| 5/5 [00:02<00:00, 2.11it/s] Applying step predicted_first_category: 100%|██████████| 5/5 [00:00<00:00, 6659.74it/s] Applying step second_categorize: 100%|██████████| 5/5 [00:02<00:00, 2.03it/s] Applying step predicted_second_category: 100%|██████████| 5/5 [00:00<00:00, 6288.31it/s] Applying step combine_taxonomy: 100%|██████████| 5/5 [00:00<00:00, 5734.62it/s]
name | description | category | brand.name | category_new | __first_categorize__ | first_category_index | predicted_first_category | __second_categorize__ | second_category_index | predicted_second_category | combine_taxonomy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | EnGauge Deluxe Bedframe | Introducing the Engauge Deluxe Bedframe - the ... | Furniture > Beds & Headboards > Bedframes | NaN | Beds & Headboards > Bedframes | {'input_tokens': 419, 'output_tokens': 11, 'in... | 1 | Beds & Headboards | {'input_tokens': 372, 'output_tokens': 11, 'in... | 2 | Bedframes | Beds & Headboards > Bedframes |
1 | Sparrow & Wren Sullivan King Channel-Stitched ... | 85"L x 83"W x 56"H | Total weight: 150 lbs. | ... | Furniture > Beds & Headboards > Beds | Sparrow & Wren | Beds & Headboards > Beds | {'input_tokens': 275, 'output_tokens': 11, 'in... | 1 | Beds & Headboards | {'input_tokens': 228, 'output_tokens': 11, 'in... | 3 | Headboards | Beds & Headboards > Headboards |
2 | Queen Bed With Frame | Dimensions:Head Board -49H x 63.75W x 1.5DFoot... | Furniture > Beds & Headboards > Beds | Hillsdale | Beds & Headboards > Beds | {'input_tokens': 230, 'output_tokens': 11, 'in... | 1 | Beds & Headboards | {'input_tokens': 183, 'output_tokens': 11, 'in... | 1 | Beds | Beds & Headboards > Beds |
3 | Dylan Queen Bed | Add a touch of a modern farmhouse to your bedr... | Furniture > Beds & Headboards > Beds | NaN | Beds & Headboards > Beds | {'input_tokens': 246, 'output_tokens': 11, 'in... | 1 | Beds & Headboards | {'input_tokens': 199, 'output_tokens': 11, 'in... | 1 | Beds | Beds & Headboards > Beds |
4 | Sparrow & Wren Mara Full Diamond-Tufted Bed | 78"L x 56"W x 51"H | Total weight: 130 lbs. | ... | Furniture > Beds & Headboards > Beds | Sparrow & Wren | Beds & Headboards > Beds | {'input_tokens': 274, 'output_tokens': 11, 'in... | 1 | Beds & Headboards | {'input_tokens': 227, 'output_tokens': 11, 'in... | 1 | Beds | Beds & Headboards > Beds |
print(categorizer_llm.statistics)
+---------------+------------------------------+ | score | 0.8 | +---------------+------------------------------+ | input_tokens | {'gpt-3.5-turbo-0125': 5306} | +---------------+------------------------------+ | output_tokens | {'gpt-3.5-turbo-0125': 221} | +---------------+------------------------------+ | input_cost | $0.002653 | +---------------+------------------------------+ | output_cost | $0.0003315 | +---------------+------------------------------+ | num_success | 5 | +---------------+------------------------------+ | num_failure | 0 | +---------------+------------------------------+ | total_latency | 10.390514832979534 | +---------------+------------------------------+
It works, let's run it on some more data like we did before.
test_df2_100 = df.head(100).copy()
categorizer_llm.run(test_df2_100)
Applying step first_categorize: 100%|██████████| 100/100 [00:58<00:00, 1.71it/s] Applying step predicted_first_category: 100%|██████████| 100/100 [00:00<00:00, 22609.58it/s] Applying step second_categorize: 100%|██████████| 100/100 [03:06<00:00, 1.86s/it] Applying step predicted_second_category: 100%|██████████| 100/100 [00:00<00:00, 27186.31it/s] Applying step combine_taxonomy: 100%|██████████| 100/100 [00:00<00:00, 32531.64it/s]
name | description | category | brand.name | category_new | __first_categorize__ | first_category_index | predicted_first_category | __second_categorize__ | second_category_index | predicted_second_category | combine_taxonomy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | EnGauge Deluxe Bedframe | Introducing the Engauge Deluxe Bedframe - the ... | Furniture > Beds & Headboards > Bedframes | NaN | Beds & Headboards > Bedframes | {'input_tokens': 419, 'output_tokens': 11, 'in... | 1 | Beds & Headboards | {'input_tokens': 372, 'output_tokens': 11, 'in... | 2 | Bedframes | Beds & Headboards > Bedframes |
1 | Sparrow & Wren Sullivan King Channel-Stitched ... | 85"L x 83"W x 56"H | Total weight: 150 lbs. | ... | Furniture > Beds & Headboards > Beds | Sparrow & Wren | Beds & Headboards > Beds | {'input_tokens': 275, 'output_tokens': 11, 'in... | 1 | Beds & Headboards | {'input_tokens': 228, 'output_tokens': 11, 'in... | 3 | Headboards | Beds & Headboards > Headboards |
2 | Queen Bed With Frame | Dimensions:Head Board -49H x 63.75W x 1.5DFoot... | Furniture > Beds & Headboards > Beds | Hillsdale | Beds & Headboards > Beds | {'input_tokens': 230, 'output_tokens': 11, 'in... | 1 | Beds & Headboards | {'input_tokens': 183, 'output_tokens': 11, 'in... | 3 | Headboards | Beds & Headboards > Headboards |
3 | Dylan Queen Bed | Add a touch of a modern farmhouse to your bedr... | Furniture > Beds & Headboards > Beds | NaN | Beds & Headboards > Beds | {'input_tokens': 246, 'output_tokens': 11, 'in... | 1 | Beds & Headboards | {'input_tokens': 199, 'output_tokens': 11, 'in... | 1 | Beds | Beds & Headboards > Beds |
4 | Sparrow & Wren Mara Full Diamond-Tufted Bed | 78"L x 56"W x 51"H | Total weight: 130 lbs. | ... | Furniture > Beds & Headboards > Beds | Sparrow & Wren | Beds & Headboards > Beds | {'input_tokens': 274, 'output_tokens': 11, 'in... | 1 | Beds & Headboards | {'input_tokens': 227, 'output_tokens': 11, 'in... | 1 | Beds | Beds & Headboards > Beds |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | Modway Melanie Tufted Button Upholstered Fabri... | Twin | Clean lines, a straightforward profile,... | Furniture > Beds & Headboards > Beds | Modway | Beds & Headboards > Beds | {'input_tokens': 331, 'output_tokens': 11, 'in... | 1 | Beds & Headboards | {'input_tokens': 284, 'output_tokens': 11, 'in... | 1 | Beds | Beds & Headboards > Beds |
96 | Concord Queen Panel Bed | Looking for a new bed that has it all? Check o... | Furniture > Beds & Headboards > Beds | Daniel's Amish | Beds & Headboards > Beds | {'input_tokens': 311, 'output_tokens': 11, 'in... | 1 | Beds & Headboards | {'input_tokens': 264, 'output_tokens': 11, 'in... | 1 | Beds | Beds & Headboards > Beds |
97 | Sparrow & Wren Myers King Bed | Dimensions: 85"L x 82"W x 56"H | Headboard hei... | Furniture > Beds & Headboards > Beds | Sparrow & Wren | Beds & Headboards > Beds | {'input_tokens': 377, 'output_tokens': 11, 'in... | 1 | Beds & Headboards | {'input_tokens': 330, 'output_tokens': 11, 'in... | 3 | Headboards | Beds & Headboards > Headboards |
98 | Loden Beige 3 Pc Queen Upholstered Bed with 2 ... | A classic design and sophisticated silhouette ... | Furniture > Beds & Headboards > Beds | Rooms To Go | Beds & Headboards > Beds | {'input_tokens': 287, 'output_tokens': 11, 'in... | 1 | Beds & Headboards | {'input_tokens': 240, 'output_tokens': 11, 'in... | 1 | Beds | Beds & Headboards > Beds |
99 | Hempstead Captain Bed in Graystone by A-America | Hempstead Captain Bed | Furniture > Beds & Headboards > Beds | A-America | Beds & Headboards > Beds | {'input_tokens': 203, 'output_tokens': 11, 'in... | 1 | Beds & Headboards | {'input_tokens': 156, 'output_tokens': 11, 'in... | 1 | Beds | Beds & Headboards > Beds |
100 rows × 12 columns
Let's compare approach 1 to approach 2.
print(categorizer.statistics)
print(f"Total cost: ${categorizer.statistics.input_cost + categorizer.statistics.output_cost}")
print(categorizer_llm.statistics)
print(f"Total cost: ${categorizer_llm.statistics.input_cost + categorizer_llm.statistics.output_cost}")
+---------------+-------------------------------+ | score | 0.91 | +---------------+-------------------------------+ | input_tokens | {'gpt-3.5-turbo-0125': 39747} | +---------------+-------------------------------+ | output_tokens | {'gpt-3.5-turbo-0125': 6728} | +---------------+-------------------------------+ | input_cost | $0.019873500000000002 | +---------------+-------------------------------+ | output_cost | $0.010092000000000002 | +---------------+-------------------------------+ | num_success | 100 | +---------------+-------------------------------+ | num_failure | 0 | +---------------+-------------------------------+ | total_latency | 191.24415755318478 | +---------------+-------------------------------+ Total cost: $0.029965500000000006 +---------------+-------------------------------+ | score | 0.76 | +---------------+-------------------------------+ | input_tokens | {'gpt-3.5-turbo-0125': 58359} | +---------------+-------------------------------+ | output_tokens | {'gpt-3.5-turbo-0125': 2651} | +---------------+-------------------------------+ | input_cost | $0.029179500000000004 | +---------------+-------------------------------+ | output_cost | $0.003976500000000004 | +---------------+-------------------------------+ | num_success | 100 | +---------------+-------------------------------+ | num_failure | 0 | +---------------+-------------------------------+ | total_latency | 254.63680869879317 | +---------------+-------------------------------+ Total cost: $0.033156000000000005
Our hierarchical approach cost just a bit more at $0.032814 / 100 rows. It was much faster and seemed to perform better on accuracy as well. However, we're not done just yet. The power of SuperPipe
is that we can easily try many different permuations of our pipeline using a grid search. There might be a better pipeline out there.
Grid search¶
Our first pipeline has three steps we want to search over.
- Short description: vary the model
- Embedding search: vary the number of results
- Categorize: vary the model
It's not clear which permutation will work the best so we'll try all of them.
from superpipe import grid_search
params_grid = {
short_description_step.name: {
'model': [models.gpt35, models.gpt4],
},
embedding_search_step.name: {
'k': [3, 5, 7],
},
categorize_step.name: {
'model': [models.gpt35, models.gpt4],
},
}
small_df = df.head(30).copy()
search_embeddings = grid_search.GridSearch(categorizer, params_grid)
search_embeddings.run(small_df)
Iteration 1 of 12 Params: {'short_description': {'model': 'gpt-3.5-turbo-0125'}, 'embedding_search': {'k': 3}, 'categorize': {'model': 'gpt-3.5-turbo-0125'}}
Applying step short_description: 100%|██████████| 30/30 [00:46<00:00, 1.55s/it] Applying step embedding_search: 100%|██████████| 30/30 [00:05<00:00, 5.70it/s] Applying step categorize: 100%|██████████| 30/30 [00:16<00:00, 1.87it/s] Applying step predicted_category: 100%|██████████| 30/30 [00:00<00:00, 27100.82it/s]
Result: {'short_description__model': 'gpt-3.5-turbo-0125', 'embedding_search__k': 3, 'categorize__model': 'gpt-3.5-turbo-0125', 'score': 0.8666666666666667, 'input_cost': 0.005675499999999998, 'output_cost': 0.0032055, 'total_latency': 62.25483133213129, 'input_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 11351}), 'output_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 2137}), 'num_success': 30, 'num_failure': 0, 'index': -6675265432874878197} Iteration 2 of 12 Params: {'short_description': {'model': 'gpt-3.5-turbo-0125'}, 'embedding_search': {'k': 3}, 'categorize': {'model': 'gpt-4-turbo-preview'}}
Applying step short_description: 100%|██████████| 30/30 [00:45<00:00, 1.53s/it] Applying step embedding_search: 100%|██████████| 30/30 [00:04<00:00, 6.68it/s] Applying step categorize: 100%|██████████| 30/30 [02:08<00:00, 4.29s/it] Applying step predicted_category: 100%|██████████| 30/30 [00:00<00:00, 28384.64it/s]
Result: {'short_description__model': 'gpt-3.5-turbo-0125', 'embedding_search__k': 3, 'categorize__model': 'gpt-4-turbo-preview', 'score': 0.9333333333333333, 'input_cost': 0.057895999999999996, 'output_cost': 0.0117495, 'total_latency': 174.37851832807064, 'input_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 5852, 'gpt-4-turbo-preview': 5497}), 'output_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 1833, 'gpt-4-turbo-preview': 300}), 'num_success': 30, 'num_failure': 0, 'index': 5054694111921705162} Iteration 3 of 12 Params: {'short_description': {'model': 'gpt-3.5-turbo-0125'}, 'embedding_search': {'k': 5}, 'categorize': {'model': 'gpt-3.5-turbo-0125'}}
Applying step short_description: 100%|██████████| 30/30 [00:46<00:00, 1.53s/it] Applying step embedding_search: 100%|██████████| 30/30 [00:04<00:00, 7.15it/s] Applying step categorize: 100%|██████████| 30/30 [00:15<00:00, 1.89it/s] Applying step predicted_category: 100%|██████████| 30/30 [00:00<00:00, 20226.51it/s]
Result: {'short_description__model': 'gpt-3.5-turbo-0125', 'embedding_search__k': 5, 'categorize__model': 'gpt-3.5-turbo-0125', 'score': 0.9, 'input_cost': 0.005970999999999999, 'output_cost': 0.003195, 'total_latency': 61.76637146304711, 'input_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 11942}), 'output_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 2130}), 'num_success': 30, 'num_failure': 0, 'index': -4607444568377834415} Iteration 4 of 12 Params: {'short_description': {'model': 'gpt-3.5-turbo-0125'}, 'embedding_search': {'k': 5}, 'categorize': {'model': 'gpt-4-turbo-preview'}}
Applying step short_description: 100%|██████████| 30/30 [00:48<00:00, 1.63s/it] Applying step embedding_search: 100%|██████████| 30/30 [00:04<00:00, 7.20it/s] Applying step categorize: 100%|██████████| 30/30 [00:47<00:00, 1.58s/it] Applying step predicted_category: 100%|██████████| 30/30 [00:00<00:00, 13654.81it/s]
Result: {'short_description__model': 'gpt-3.5-turbo-0125', 'embedding_search__k': 5, 'categorize__model': 'gpt-4-turbo-preview', 'score': 0.9666666666666667, 'input_cost': 0.062886, 'output_cost': 0.011592, 'total_latency': 96.02623478500755, 'input_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 5852, 'gpt-4-turbo-preview': 5996}), 'output_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 1728, 'gpt-4-turbo-preview': 300}), 'num_success': 30, 'num_failure': 0, 'index': -8503795277776717559} Iteration 5 of 12 Params: {'short_description': {'model': 'gpt-3.5-turbo-0125'}, 'embedding_search': {'k': 7}, 'categorize': {'model': 'gpt-3.5-turbo-0125'}}
Applying step short_description: 100%|██████████| 30/30 [00:43<00:00, 1.45s/it] Applying step embedding_search: 100%|██████████| 30/30 [00:04<00:00, 6.99it/s] Applying step categorize: 100%|██████████| 30/30 [00:18<00:00, 1.65it/s] Applying step predicted_category: 100%|██████████| 30/30 [00:00<00:00, 13476.40it/s]
Result: {'short_description__model': 'gpt-3.5-turbo-0125', 'embedding_search__k': 7, 'categorize__model': 'gpt-3.5-turbo-0125', 'score': 0.9, 'input_cost': 0.006242499999999998, 'output_cost': 0.003072, 'total_latency': 61.58326062496053, 'input_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 12485}), 'output_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 2048}), 'num_success': 30, 'num_failure': 0, 'index': 4015312520520374081} Iteration 6 of 12 Params: {'short_description': {'model': 'gpt-3.5-turbo-0125'}, 'embedding_search': {'k': 7}, 'categorize': {'model': 'gpt-4-turbo-preview'}}
Applying step short_description: 100%|██████████| 30/30 [00:41<00:00, 1.39s/it] Applying step embedding_search: 100%|██████████| 30/30 [00:04<00:00, 7.16it/s] Applying step categorize: 100%|██████████| 30/30 [00:51<00:00, 1.70s/it] Applying step predicted_category: 100%|██████████| 30/30 [00:00<00:00, 19828.10it/s]
Result: {'short_description__model': 'gpt-3.5-turbo-0125', 'embedding_search__k': 7, 'categorize__model': 'gpt-4-turbo-preview', 'score': 0.9333333333333333, 'input_cost': 0.06853599999999999, 'output_cost': 0.0115485, 'total_latency': 92.52336829315755, 'input_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 5852, 'gpt-4-turbo-preview': 6561}), 'output_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 1699, 'gpt-4-turbo-preview': 300}), 'num_success': 30, 'num_failure': 0, 'index': -6042391003316854449} Iteration 7 of 12 Params: {'short_description': {'model': 'gpt-4-turbo-preview'}, 'embedding_search': {'k': 3}, 'categorize': {'model': 'gpt-3.5-turbo-0125'}}
Applying step short_description: 100%|██████████| 30/30 [02:13<00:00, 4.45s/it] Applying step embedding_search: 100%|██████████| 30/30 [00:04<00:00, 6.62it/s] Applying step categorize: 100%|██████████| 30/30 [00:21<00:00, 1.38it/s] Applying step predicted_category: 100%|██████████| 30/30 [00:00<00:00, 14540.00it/s]
Result: {'short_description__model': 'gpt-4-turbo-preview', 'embedding_search__k': 3, 'categorize__model': 'gpt-3.5-turbo-0125', 'score': 0.6666666666666666, 'input_cost': 0.061239999999999996, 'output_cost': 0.05397149999999999, 'total_latency': 155.0610504578508, 'input_tokens': defaultdict(<class 'int'>, {'gpt-4-turbo-preview': 5852, 'gpt-3.5-turbo-0125': 5440}), 'output_tokens': defaultdict(<class 'int'>, {'gpt-4-turbo-preview': 1784, 'gpt-3.5-turbo-0125': 301}), 'num_success': 30, 'num_failure': 0, 'index': -3802806156793363307} Iteration 8 of 12 Params: {'short_description': {'model': 'gpt-4-turbo-preview'}, 'embedding_search': {'k': 3}, 'categorize': {'model': 'gpt-4-turbo-preview'}}
Applying step short_description: 100%|██████████| 30/30 [02:10<00:00, 4.35s/it] Applying step embedding_search: 100%|██████████| 30/30 [00:04<00:00, 7.09it/s] Applying step categorize: 100%|██████████| 30/30 [00:39<00:00, 1.31s/it] Applying step predicted_category: 100%|██████████| 30/30 [00:00<00:00, 13742.80it/s]
Result: {'short_description__model': 'gpt-4-turbo-preview', 'embedding_search__k': 3, 'categorize__model': 'gpt-4-turbo-preview', 'score': 0.9, 'input_cost': 0.11293, 'output_cost': 0.062310000000000004, 'total_latency': 169.59190933196805, 'input_tokens': defaultdict(<class 'int'>, {'gpt-4-turbo-preview': 11293}), 'output_tokens': defaultdict(<class 'int'>, {'gpt-4-turbo-preview': 2077}), 'num_success': 30, 'num_failure': 0, 'index': -3569261079577541644} Iteration 9 of 12 Params: {'short_description': {'model': 'gpt-4-turbo-preview'}, 'embedding_search': {'k': 5}, 'categorize': {'model': 'gpt-3.5-turbo-0125'}}
Applying step short_description: 100%|██████████| 30/30 [01:57<00:00, 3.91s/it] Applying step embedding_search: 100%|██████████| 30/30 [00:04<00:00, 7.20it/s] Applying step categorize: 100%|██████████| 30/30 [00:14<00:00, 2.04it/s] Applying step predicted_category: 100%|██████████| 30/30 [00:00<00:00, 18842.34it/s]
Result: {'short_description__model': 'gpt-4-turbo-preview', 'embedding_search__k': 5, 'categorize__model': 'gpt-3.5-turbo-0125', 'score': 0.8666666666666667, 'input_cost': 0.0615815, 'output_cost': 0.055741500000000006, 'total_latency': 131.90737874808838, 'input_tokens': defaultdict(<class 'int'>, {'gpt-4-turbo-preview': 5852, 'gpt-3.5-turbo-0125': 6123}), 'output_tokens': defaultdict(<class 'int'>, {'gpt-4-turbo-preview': 1843, 'gpt-3.5-turbo-0125': 301}), 'num_success': 30, 'num_failure': 0, 'index': 9106143806313371546} Iteration 10 of 12 Params: {'short_description': {'model': 'gpt-4-turbo-preview'}, 'embedding_search': {'k': 5}, 'categorize': {'model': 'gpt-4-turbo-preview'}}
Applying step short_description: 100%|██████████| 30/30 [02:04<00:00, 4.14s/it] Applying step embedding_search: 100%|██████████| 30/30 [00:04<00:00, 6.91it/s] Applying step categorize: 100%|██████████| 30/30 [00:41<00:00, 1.38s/it] Applying step predicted_category: 100%|██████████| 30/30 [00:00<00:00, 17329.45it/s]
Result: {'short_description__model': 'gpt-4-turbo-preview', 'embedding_search__k': 5, 'categorize__model': 'gpt-4-turbo-preview', 'score': 0.9333333333333333, 'input_cost': 0.11912999999999999, 'output_cost': 0.06251999999999999, 'total_latency': 165.1544967039954, 'input_tokens': defaultdict(<class 'int'>, {'gpt-4-turbo-preview': 11913}), 'output_tokens': defaultdict(<class 'int'>, {'gpt-4-turbo-preview': 2084}), 'num_success': 30, 'num_failure': 0, 'index': 2557837101084672621} Iteration 11 of 12 Params: {'short_description': {'model': 'gpt-4-turbo-preview'}, 'embedding_search': {'k': 7}, 'categorize': {'model': 'gpt-3.5-turbo-0125'}}
Applying step short_description: 100%|██████████| 30/30 [01:51<00:00, 3.71s/it] Applying step embedding_search: 100%|██████████| 30/30 [00:08<00:00, 3.53it/s] Applying step categorize: 100%|██████████| 30/30 [00:15<00:00, 1.99it/s] Applying step predicted_category: 100%|██████████| 30/30 [00:00<00:00, 19242.87it/s]
Result: {'short_description__model': 'gpt-4-turbo-preview', 'embedding_search__k': 7, 'categorize__model': 'gpt-3.5-turbo-0125', 'score': 0.8333333333333334, 'input_cost': 0.0618185, 'output_cost': 0.05157, 'total_latency': 126.11481383198407, 'input_tokens': defaultdict(<class 'int'>, {'gpt-4-turbo-preview': 5852, 'gpt-3.5-turbo-0125': 6597}), 'output_tokens': defaultdict(<class 'int'>, {'gpt-4-turbo-preview': 1704, 'gpt-3.5-turbo-0125': 300}), 'num_success': 30, 'num_failure': 0, 'index': -3503115138122502664} Iteration 12 of 12 Params: {'short_description': {'model': 'gpt-4-turbo-preview'}, 'embedding_search': {'k': 7}, 'categorize': {'model': 'gpt-4-turbo-preview'}}
Applying step short_description: 100%|██████████| 30/30 [01:53<00:00, 3.78s/it] Applying step embedding_search: 100%|██████████| 30/30 [00:03<00:00, 8.31it/s] Applying step categorize: 100%|██████████| 30/30 [00:35<00:00, 1.18s/it] Applying step predicted_category: 100%|██████████| 30/30 [00:00<00:00, 15283.51it/s]
Result: {'short_description__model': 'gpt-4-turbo-preview', 'embedding_search__k': 7, 'categorize__model': 'gpt-4-turbo-preview', 'score': 0.9333333333333333, 'input_cost': 0.12448999999999999, 'output_cost': 0.06029999999999999, 'total_latency': 148.47731783005293, 'input_tokens': defaultdict(<class 'int'>, {'gpt-4-turbo-preview': 12449}), 'output_tokens': defaultdict(<class 'int'>, {'gpt-4-turbo-preview': 2010}), 'num_success': 30, 'num_failure': 0, 'index': -3155834270503923271}
/Users/bscharfstein/Projects/Stelo/code/superpipe/superpipe/grid_search.py:146: FutureWarning: Styler.applymap has been deprecated. Use Styler.map instead. styler = styler.applymap(lambda val, col=col: apply_style(val, col), subset=[col])
short_description__model | embedding_search__k | categorize__model | score | input_cost | output_cost | total_latency | input_tokens | output_tokens | num_success | num_failure | index | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | gpt-3.5-turbo-0125 | 3 | gpt-3.5-turbo-0125 | 0.866667 | 0.005675 | 0.003206 | 62.254831 | defaultdict( |
defaultdict( |
30 | 0 | -6675265432874878197 |
1 | gpt-3.5-turbo-0125 | 3 | gpt-4-turbo-preview | 0.933333 | 0.057896 | 0.011749 | 174.378518 | defaultdict( |
defaultdict( |
30 | 0 | 5054694111921705162 |
2 | gpt-3.5-turbo-0125 | 5 | gpt-3.5-turbo-0125 | 0.900000 | 0.005971 | 0.003195 | 61.766371 | defaultdict( |
defaultdict( |
30 | 0 | -4607444568377834415 |
3 | gpt-3.5-turbo-0125 | 5 | gpt-4-turbo-preview | 0.966667 | 0.062886 | 0.011592 | 96.026235 | defaultdict( |
defaultdict( |
30 | 0 | -8503795277776717559 |
4 | gpt-3.5-turbo-0125 | 7 | gpt-3.5-turbo-0125 | 0.900000 | 0.006242 | 0.003072 | 61.583261 | defaultdict( |
defaultdict( |
30 | 0 | 4015312520520374081 |
5 | gpt-3.5-turbo-0125 | 7 | gpt-4-turbo-preview | 0.933333 | 0.068536 | 0.011548 | 92.523368 | defaultdict( |
defaultdict( |
30 | 0 | -6042391003316854449 |
6 | gpt-4-turbo-preview | 3 | gpt-3.5-turbo-0125 | 0.666667 | 0.061240 | 0.053971 | 155.061050 | defaultdict( |
defaultdict( |
30 | 0 | -3802806156793363307 |
7 | gpt-4-turbo-preview | 3 | gpt-4-turbo-preview | 0.900000 | 0.112930 | 0.062310 | 169.591909 | defaultdict( |
defaultdict( |
30 | 0 | -3569261079577541644 |
8 | gpt-4-turbo-preview | 5 | gpt-3.5-turbo-0125 | 0.866667 | 0.061581 | 0.055742 | 131.907379 | defaultdict( |
defaultdict( |
30 | 0 | 9106143806313371546 |
9 | gpt-4-turbo-preview | 5 | gpt-4-turbo-preview | 0.933333 | 0.119130 | 0.062520 | 165.154497 | defaultdict( |
defaultdict( |
30 | 0 | 2557837101084672621 |
10 | gpt-4-turbo-preview | 7 | gpt-3.5-turbo-0125 | 0.833333 | 0.061818 | 0.051570 | 126.114814 | defaultdict( |
defaultdict( |
30 | 0 | -3503115138122502664 |
11 | gpt-4-turbo-preview | 7 | gpt-4-turbo-preview | 0.933333 | 0.124490 | 0.060300 | 148.477318 | defaultdict( |
defaultdict( |
30 | 0 | -3155834270503923271 |
The results of our grid search are conveniently put into a dataframe for us to review.
Its seems that GPT-3.5 is more than sufficient for our description step and that 5 embeddings results is as well. For the last step, we have a cost/latency vs. accuracy tradeoff we need to make between the two models.
This search was only run on 30 rows so we'd want to run it more extensively before making decisions for production but at least now we can reasonably confidently narrow down our search space.
Let's do the same for our hierarchical prompting approach. This time we'll just vary the model selection for each step.
params_grid = {
first_level_category_step.name: {
'model': [models.gpt35, models.gpt4],
},
second_level_category_step.name: {
'model': [models.gpt35, models.gpt4],
},
}
small_df2 = df.head(30).copy()
search_llm = grid_search.GridSearch(categorizer_llm, params_grid)
search_llm.run(small_df2)
Iteration 1 of 4 Params: {'first_categorize': {'model': 'gpt-3.5-turbo-0125'}, 'second_categorize': {'model': 'gpt-3.5-turbo-0125'}}
Applying step first_categorize: 100%|██████████| 30/30 [00:20<00:00, 1.45it/s] Applying step predicted_first_category: 100%|██████████| 30/30 [00:00<00:00, 6613.54it/s] Applying step second_categorize: 100%|██████████| 30/30 [00:16<00:00, 1.80it/s] Applying step predicted_second_category: 100%|██████████| 30/30 [00:00<00:00, 6980.04it/s] Applying step combine_taxonomy: 100%|██████████| 30/30 [00:00<00:00, 20631.11it/s]
Result: {'first_categorize__model': 'gpt-3.5-turbo-0125', 'second_categorize__model': 'gpt-3.5-turbo-0125', 'score': 0.7666666666666667, 'input_cost': 0.008323999999999998, 'output_cost': 0.0009915000000000006, 'total_latency': 37.14652425216627, 'input_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 16648}), 'output_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 661}), 'num_success': 30, 'num_failure': 0, 'index': 8291905896722117770} Iteration 2 of 4 Params: {'first_categorize': {'model': 'gpt-3.5-turbo-0125'}, 'second_categorize': {'model': 'gpt-4-turbo-preview'}}
Applying step first_categorize: 100%|██████████| 30/30 [00:15<00:00, 1.96it/s] Applying step predicted_first_category: 100%|██████████| 30/30 [00:00<00:00, 10063.11it/s] Applying step second_categorize: 100%|██████████| 30/30 [00:45<00:00, 1.50s/it] Applying step predicted_second_category: 100%|██████████| 30/30 [00:00<00:00, 6388.56it/s] Applying step combine_taxonomy: 100%|██████████| 30/30 [00:00<00:00, 25450.87it/s]
Result: {'first_categorize__model': 'gpt-3.5-turbo-0125', 'second_categorize__model': 'gpt-4-turbo-preview', 'score': 0.9333333333333333, 'input_cost': 0.080676, 'output_cost': 0.010305000000000009, 'total_latency': 60.223217750986805, 'input_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 9032, 'gpt-4-turbo-preview': 7616}), 'output_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 330, 'gpt-4-turbo-preview': 327}), 'num_success': 30, 'num_failure': 0, 'index': 3072009371341041402} Iteration 3 of 4 Params: {'first_categorize': {'model': 'gpt-4-turbo-preview'}, 'second_categorize': {'model': 'gpt-3.5-turbo-0125'}}
Applying step first_categorize: 100%|██████████| 30/30 [00:59<00:00, 1.97s/it] Applying step predicted_first_category: 100%|██████████| 30/30 [00:00<00:00, 22137.42it/s] Applying step second_categorize: 100%|██████████| 30/30 [00:16<00:00, 1.85it/s] Applying step predicted_second_category: 100%|██████████| 30/30 [00:00<00:00, 8736.31it/s] Applying step combine_taxonomy: 100%|██████████| 30/30 [00:00<00:00, 22832.36it/s]
Result: {'first_categorize__model': 'gpt-4-turbo-preview', 'second_categorize__model': 'gpt-3.5-turbo-0125', 'score': 0.7333333333333333, 'input_cost': 0.094135, 'output_cost': 0.010396500000000008, 'total_latency': 75.15457291598432, 'input_tokens': defaultdict(<class 'int'>, {'gpt-4-turbo-preview': 9032, 'gpt-3.5-turbo-0125': 7630}), 'output_tokens': defaultdict(<class 'int'>, {'gpt-4-turbo-preview': 330, 'gpt-3.5-turbo-0125': 331}), 'num_success': 30, 'num_failure': 0, 'index': -172566906004615760} Iteration 4 of 4 Params: {'first_categorize': {'model': 'gpt-4-turbo-preview'}, 'second_categorize': {'model': 'gpt-4-turbo-preview'}}
Applying step first_categorize: 100%|██████████| 30/30 [00:45<00:00, 1.52s/it] Applying step predicted_first_category: 100%|██████████| 30/30 [00:00<00:00, 12587.95it/s] Applying step second_categorize: 100%|██████████| 30/30 [00:40<00:00, 1.34s/it] Applying step predicted_second_category: 100%|██████████| 30/30 [00:00<00:00, 8019.19it/s] Applying step combine_taxonomy: 100%|██████████| 30/30 [00:00<00:00, 22762.14it/s]
Result: {'first_categorize__model': 'gpt-4-turbo-preview', 'second_categorize__model': 'gpt-4-turbo-preview', 'score': 0.9, 'input_cost': 0.16662, 'output_cost': 0.019800000000000016, 'total_latency': 85.77000208501704, 'input_tokens': defaultdict(<class 'int'>, {'gpt-4-turbo-preview': 16662}), 'output_tokens': defaultdict(<class 'int'>, {'gpt-4-turbo-preview': 660}), 'num_success': 30, 'num_failure': 0, 'index': 7445854736442369664}
/Users/bscharfstein/Projects/Stelo/code/superpipe/superpipe/grid_search.py:146: FutureWarning: Styler.applymap has been deprecated. Use Styler.map instead. styler = styler.applymap(lambda val, col=col: apply_style(val, col), subset=[col])
first_categorize__model | second_categorize__model | score | input_cost | output_cost | total_latency | input_tokens | output_tokens | num_success | num_failure | index | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | gpt-3.5-turbo-0125 | gpt-3.5-turbo-0125 | 0.766667 | 0.008324 | 0.000992 | 37.146524 | defaultdict( |
defaultdict( |
30 | 0 | 8291905896722117770 |
1 | gpt-3.5-turbo-0125 | gpt-4-turbo-preview | 0.933333 | 0.080676 | 0.010305 | 60.223218 | defaultdict( |
defaultdict( |
30 | 0 | 3072009371341041402 |
2 | gpt-4-turbo-preview | gpt-3.5-turbo-0125 | 0.733333 | 0.094135 | 0.010397 | 75.154573 | defaultdict( |
defaultdict( |
30 | 0 | -172566906004615760 |
3 | gpt-4-turbo-preview | gpt-4-turbo-preview | 0.900000 | 0.166620 | 0.019800 | 85.770002 | defaultdict( |
defaultdict( |
30 | 0 | 7445854736442369664 |
These results highlight the importance of experimentation and optimization. As we can see, the GPT-3.5 + GPT-4 hierarchical pipeline performs the best with relatively low latency with the GPT-3.5 only approach performing about as well as the GPT-3.5 only + 5 embedding approach.
If we only care about accuracy, it looks like an embeddings based approach is our best bet. However, we may have other considerations. We're faced with a cost, accuracy, and latency tradeoff with no clear "best" option. Depending on what metric we care we'll choose a different approach. This is a decision we're now empowered to make with our Superpipe pipeline results.