- Pondhouse Data OG - We know data & AI
- Posts
- Pondhouse Data AI - Edition 11
Pondhouse Data AI - Edition 11
Model Distillation - how to get a fast, strong and cost efficient LLM | Evaluating LLMs before upgrading | How to: Realtime audio transcriptions | New ChatGPT feature: Canvas - a better way of collaboration with LLMs

Hey there,
We’re excited to bring you the 11th edition of our Pondhouse AI newsletter — your source for tips and tricks around AI and LLMs. Whether you want to learn about AI concepts, use AI tools effectively, or see inspiring examples, we’ve got you covered.
Let’s get started!
Cheers, Andreas & Sascha
In todays edition:
News: OpenAI releases model distillation - a technique to create very strong, but fast and small/cost efficient models
Case Study: How we built an AI-driven SEO content recommendation system using only PostgreSQL
Tip of the Week: How to: Realtime audio transcriptions using the new Whisper 3 turbo model
Tool of the Week: Prompt Flow - Microsoft’s solution for evaluating prompts and finding the best models for your use case
Find this Newsletter helpful?
Please forward it to your colleagues and friends - it helps us tremendously.
Top News
OpenAI releases Model Distillation - creating strong, small, cost efficient models for FREE
As part of their latest OpenAI Developer Days, OpenAI released their model distillation API.
Model distillation is a model training method where a large, very strong model is used as “teacher” for a smaller model.

(Image from https://arxiv.org/abs/2006.05525)
The key benefit of model distillation is that it allows developers to extract knowledge from large models—such as GPT-4—and embed it into smaller, faster models. These small and fast models often reach similar answer quality than their ‘teacher’ counterparts in the specific, distilled areas - meaning, model distillation is some sort of “free lunch”: You get fast, cost efficient and strong models, without too many compromises.
Popular applications for model distillation are edge computing, real-time applications, low-latency applications and, generally speaking, environments with limited resources.
Model distillation is quite complex to set up, but with OpenAI's new API, developers and researchers now have direct access to distillation techniques, making it easier than ever to incorporate this optimization into their workflows.
How to create a model distillation?
Create an evaluation to measure the performance of the model you want to distill into, which in this example will be GPT-4o mini. This evaluation will be used to continuously test the distilled model’s performance, to help you decide whether to deploy it.

Next, use the newly introduced Stored Completions to create a distillation dataset of real-world examples using GPT-4o’s outputs for the tasks on which you want to fine-tune GPT-4o mini. You can do this by setting the ‘store:true’ flag in the Chat Completions API to automatically store these input-output pairs without any latency impact. These stored completions can be reviewed, filtered, and tagged to create high-quality datasets for fine-tuning or evaluation.
(Stored Completions are nothing else than a dataset of questions and answers artificially created by using the “teacher” model).
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "what's the capital of the USA?"
}
]
}
],
store=True,
metadata={"username": "user123", "user_id": "123", "session_id": "123"}
Use this dataset to fine-tune GPT-4o mini. Stored Completions can be used as a training file when creating a fine-tuned model. Once the model is fine-tuned, you can go back to Evals to test whether the fine-tuned GPT-4o mini model meets your performance criteria when compared to GPT-4o

What about costs?
Believe it or not, but OpenAI offers model distillation for up to 2 Million tokens per day for free.
Running the distilled model - if we use GPT-4o mini as distillation target, which is the most reasonable option at the moment, costs:
$0.30/1 Million input tokens
$1.20/1 Million output tokens
For more information, read the full announcement here.
Tutorials & Use Cases
Case Study: Building an AI content recommendation system with only PostgreSQL
Recently, we built an AI-driven SEO content recommendation system for a client of ours. We wanted to create a service which automatically recommends similar articles to the one currently in creation - allowing for SEO-friendly and relevant inline link recommendations.
As for many of our projects, there were distinct requirements:
The used technology needs to reduce complexity as much as possible, as we are a small team.
We need easy access to large language models without needing to set up extensive infrastructure.
We need some sort of vector similarity search, to find relevant articles
The application needs a light-weight teams permission feature, as not every user must see every article
After careful evaluation, we settled on using PostgreSQL as our main database and added two extensions: pgai and pgvectorscale - two pieces of software created by Timescale and providing AI integrations directly into PostgreSQL.
The major point to drive home here: We didn’t need any additional infrastructure for all our AI integrations - just PostgreSQL with two extensions. This is a major improvement in terms of infrastructure complexity when it comes to AI projects.
To get an idea about how we accomplished that - and also to get a real-world impression of modern AI integrations, we are happy to provide this in-depth case-study and tutorial below.
Also in the news
OpenAI releases ChatGPT canvas - a better way to create code and text
OpenAI’s ChatGPT got a new interface for working with ChatGPT on writing and coding projects that go beyond simple chat. Canvas opens in a separate window, allowing you and ChatGPT to collaborate on a project. In the Canvas you can directly edit text or code. There’s a menu of shortcuts for you to ask ChatGPT to adjust writing length, debug your code, and quickly perform other useful actions. You can also restore previous versions of your work by using the back button in canvas.
Overall, this is a very good enhancement of ChatGPT, drastically improving its usefulness when it comes to generating long form content or especially code!

Read OpenAI’s announcement here.
Google’s NotebookLM now supports YouTube content
NotebookLM is Google’s “AI Research Assistant” - a tool to import data, with a quite good Retrieval Augmented Generation system on top. You can upload documents and NotebookLM allows to ask questions on top of your data. While we’ve seen these applications many times in the past, Google’s implementation is very good and takes it a step further: You can use NotebookLM to create draft content, with strong grounding in your uploaded documents. This makes it one of the best AI content applications at the moment.
With the latest release, NotebookLM now also supports YouTube videos as data source - allowing you to use videos as part of your source content collection. This is very powerful indeed, as it allows to extract insights directly from videos, without the hassle of needing to transcribe them first. In combination with the excellent content drafting and Q&A support, NotebookLM seems to be one of the most exciting Google products in a long time.
Read their full announcement here.
Nvidia launches best Open Source vision model so far - rivalling GPT-4o and Llama 3.1 405B
NVIDIA has launched the NVLM-D 72B, a powerful multimodal LLM designed for high-performance vision-language tasks and text-based queries. This 72-billion parameter model achieves state-of-the-art results across benchmarks like TextVQA, OCRBench, and RealWorldQA, offering robust multimodal reasoning and complex problem-solving capabilities. Optimized for the Hugging Face platform, it allows for efficient multi-GPU deployment and supports both visual and textual inputs.
Three things remain remarkable:
It’s one of the best vision models up to date. Rivalling even GPT-4o
It’s only 72B parameters - compared to the 405B parameters of Llama 3.1
It’s available in the EU (which Llama 405B is not) - proving that Meta had no AI regulatory reasons to withhold it’s model from this market
For more details, read the model card here.
Tip of the week
Realtime Audio Transcriptions using OpenAI’s new whisper-3-turbo audio model
Recently, OpenAI released a new, much faster version of their remarkable Whisper 3 audio to text model, called Whisper 3 Turbo.
Using this model, multi-lingual realtime transcriptions are very easy to use and implement. We just need some lines of python code.
In this example, we are using the Gradio Audio interface, which gives us easy access to out microphone. However any audio input streaming provider would do.
As a prerequisite, make sure to install torch, gradio, scipy, transformers, numpy, flash-attn
import torch
import gradio as gr
import tempfile
import os
import uuid
import scipy.io.wavfile
import time
import numpy as np
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, WhisperTokenizer, pipeline
import subprocess
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16
MODEL_NAME = "openai/whisper-large-v3-turbo"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
MODEL_NAME, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="flash_attention_2"
)
model.to(device)
processor = AutoProcessor.from_pretrained(MODEL_NAME)
tokenizer = WhisperTokenizer.from_pretrained(MODEL_NAME)
pipe = pipeline(
task="automatic-speech-recognition",
model=model,
tokenizer=tokenizer,
feature_extractor=processor.feature_extractor,
chunk_length_s=10,
torch_dtype=torch_dtype,
device=device,
)
def transcribe(inputs, previous_transcription):
start_time = time.time()
try:
filename = f"{uuid.uuid4().hex}.wav"
sample_rate, audio_data = inputs
scipy.io.wavfile.write(filename, sample_rate, audio_data)
transcription = pipe(filename)["text"]
previous_transcription += transcription
end_time = time.time()
latency = end_time - start_time
return previous_transcription, f"{latency:.2f}"
except Exception as e:
print(f"Error during Transcription: {e}")
return previous_transcription, "Error"
with gr.Blocks() as realtime_demo:
with gr.Column():
gr.Markdown("Realtime Audio Transcription")
with gr.Row():
input_audio_microphone = gr.Audio(streaming=True)
output = gr.Textbox(label="Transcription", value="")
latency_textbox = gr.Textbox(label="Latency (seconds)", value="0.0", scale=0)
with gr.Row():
clear_button = gr.Button("Clear")
input_audio_microphone.stream(transcribe, [input_audio_microphone, output], [output, latency_textbox], time_limit=45, stream_every=2, concurrency_limit=None)
clear_button.click(clear, outputs=[output])
realtime_demo.launch()
This simple application provides a fully-working template for creating realtime audio descriptions and any of the Whisper 3 supported languages.
A working implementation of this code snippet was provided by user KingNish on Hugging Face (where we also took the liberty to find most of the code used in this example). You can try it out without writing any code.
Tool of the week
Prompt Flow: Evaluating and comparing models against each other
What is Prompt Flow?
Prompt Flow is a feature within Azure AI Studio (or a standalone open source python package) that streamlines the evaluation and management of AI models. It provides a framework for testing models against key metrics, ensuring consistency, reliability, and improved performance through iterations.
Key Features and Benefits:
Seamless model comparison and upgrades
Tools to evaluate relevance, coherence, and safety
Integration with Azure's full suite of AI services
Automation in model deployment and evaluation
Why Choose Prompt Flow?
Prompt Flow simplifies the process of upgrading models, offering an intuitive way to maintain cutting-edge performance while handling model deprecations.
For a full tutorial on how to use PromptFlow, click the link below:
We hope you liked our newsletter and you stay tuned for the next edition. If you need help with your AI tasks and implementations - let us know. We are happy to help