- Pondhouse Data OG - We know data & AI
- Posts
- Pondhouse Data AI - Edition 7
Pondhouse Data AI - Edition 7
Free fine-tuning of GPT-4o-mini | Reflection loops: A simple & powerful way of increasing LLM output quality | Germany releases the best image LLM so far

Hey there,
We’re excited to bring you the 7th edition of our Pondhouse AI newsletter — your source for tips and tricks around AI and LLMs. Whether you want to learn about AI concepts, use AI tools effectively, or see inspiring examples, we’ve got you covered.
Let’s get started!
Cheers, Andreas & Sascha
In todays edition:
News: A German team launched a new image generation model - creating the best artificial images so far.
Tutorial: FREE fine-tuning of GPT-4o mini: A step-by-step guide.
Tip of the Week: Reflection loops: How to drastically improve LLM output with simple “reflection agents”
Tool of the Week: LLMDatasets - a collection of high-quality datasets for LLM fine-tuning
Find this Newsletter helpful?
Please forward it to your colleagues and friends - it helps us tremendously.
Top News
Surprise-Launch: A highly skilled German team launched the so-far best image generation model, named Flux.1
Who would have thought that after Midjourney and Stable Diffusion another LLM would take a foothold in the image generation space?
And who would have thought that a small European team would take the lead in image generation - after American and Chinese models basically universally dominated the market so far?
But here we are - and this is very good news:
We have a new, impressive image generation model
It shows that even years after the “big players” introduced their models, it’s still possible to innovate and provide breakthrough models without needing Billions of dollars in funding
The model in question is the “Flux.1” image generation model, developed and released by a small German team “Black Forest Labs”.
Key Features:
Name: Flux.1
12 Billion parameters
Generated outputs can be used for personal, scientific, and commercial purposes
Open weights
Also available as hosted model via their API
Can be used with ComfyUI
Performance Metrics: The flagship models Flux.1 pro and Flux.1-dev both dominate the ELO benchmark. Furthermore, even their Flux.1-fast model leads Midjourney-V6.0

What’s better with this model compared to others?
While Midjourney v6 and Stable Diffusion are certainly very good models for image generation, Flux.1 beats them in virtually any relevant discipline:
They have best-in-class text in image generation
Their artificially generated persons are by far the best we’ve seen so far. It’s incredibly hard to find any detail giving away that it’s AI generated.
The model is good across a variety of image styles. From realistic human image depictions to cyberpunk-style dystopias and fantasy-like images - the quality is good across all of them
Their output is relatively stable, meaning the same prompt produces relatively similar outputs across invocations
For more details, read here.
Tutorials & Use Cases
FREE fine-tuning of GPT-4o-mini: A hands-on guide
GPT-4o-mini is the small brother of GPT-4o. It is very cost efficient and produces quite good results - however is clearly sub-par to the larger GPT-4 variants.
However, GPT-4o-mini has one very distinguished advantage: It can be fine-tuned for free - until September 23rd, 2024. And therefore potentially be a much better choice for specialised use-cases.
Key Takeaways:
GPT-4o mini offers a balance between performance and cost-effectiveness.
Fine-tuning is free until September 23, 2024, making it an excellent time to experiment.
The process requires careful preparation of high-quality training data.
Step-by-Step Guide:
Prepare Your Dataset:
Format your data in JSONL (JSON Lines) format.
For chat models, use arrays of messages with "user" and "assistant" roles.
Aim for at least 100 examples, but 500-10,000 is typically more effective.
Access OpenAI's Fine-Tuning Dashboard:
Navigate to the OpenAI fine-tuning interface.
Click "Create" to start a new fine-tuning job.
Configure Your Fine-Tuning Job:
Select gpt-4o-mini as the base model.
Upload your JSONL training data file.
Set a suffix for easy identification of your model.
Leave batch size and learning rate multiplier on "auto" for initial runs.
Monitor the Training Process:
Keep an eye on the training loss chart.
A decreasing training loss indicates successful learning.
Test Your Fine-Tuned Model:
Use the OpenAI Chat Playground to interact with your model.
Implement the model in your application using the OpenAI SDK.

Best Practices:
Prioritize data quality over quantity.
Start with a moderate dataset and incrementally increase if needed.
Regularly evaluate your model's performance.
Use fine-tuning for altering behavior or output format, not for adding new knowledge (use RAG for that).
For a more complete guide, read the full tutorial here.
Also in the news
OpenAI hit by leadership exodus
OpenAI is experiencing a significant leadership crisis as several key figures announce their departures:
John Schulman, co-founder, is leaving to join rival Anthropic.
Jan Leike, co-lead of the superalignment group, departed for Anthropic in May.
Ilya Sutskever, co-founder and chief scientist, left to start his own AI safety startup.
Greg Brockman, president and co-founder, announced an extended sabbatical until year-end.
Peter Deng, a high-profile AI figure who joined last year, has also exited the company.
These departures come at a challenging time for OpenAI, which is currently facing a lawsuit from Elon Musk over alleged breaches of its non-profit commitments. With only three of the original eleven founders remaining, OpenAI faces an uphill battle amidst increasing competition and legal challenges in the rapidly evolving AI landscape.
This issue highlights one very important aspect of your AI projects: Make sure to not only rely on OpenAI LLM models. While we are big fans of their models - we need to make sure to not lock ourselves in!
Read the full article here.
Another big win for Open Source: Qwen2-Math is the best model for math tasks
Alibaba's DAMO Academy has launched Qwen2-Math, a series of specialized large language models designed to excel in mathematical reasoning and problem-solving. Built upon the Qwen2 foundation, these models have demonstrated remarkable capabilities:
Qwen2-Math-72B-Instruct outperforms leading AI models, including GPT-4 and Claude 3.5, on various math benchmarks.
Researchers employed innovative techniques, including a math-specific reward model and reinforcement learning, to enhance performance.
Rigorous decontamination methods were applied to ensure the integrity of training and evaluation data.
For more details, see the official model repository here here.
Cohere co-founder: we need to be more realistic about what AI can and cannot do
Despite concerns of an AI bubble, industry insiders like Nick Frosst, co-founder of Cohere, argue that the technology is delivering tangible value to enterprises.
Frosst highlights the practical applications of AI in automating processes and enabling new features for businesses. However, Frosst cautions against overhyping AI's potential, stating that artificial general intelligence is likely far off.
The key takeaway? While AI shows tremendous promise and is creating significant value, companies should maintain a realistic perspective on its capabilities and limitations. As the industry matures, a balanced view of AI's potential may help separate genuine innovations from inflated expectations.
After several AI projects, we agree with the above statements. We suggest:
1. Don’t invest hundreds of thousands of dollars in AI projects. Start with small prototypes, implemented by specialised teams.
2. See which of the prototypes produces value. Scrap the others. Experiment.
3. Use hosted LLMs like Azure OpenAI to iterate fast. Once you found a good use-case, consider using a more specialised, fine-tuned LLM eg. from the Llama 3.1 family.
For more details, read the full article here.
Tip of the week
Drastically improve LLM output with simple reflection agents
Due to the stochastic nature of LLMs, their output varies from invocation to invocation. This can be quite frustrating - as you see very good results in one minute and absolute catastrophic results in another - with one and the same prompt.
To tackle this issue, we can implement a pattern called "Reflection Agents”. The concept is rather simple:
We send our main prompt to an LLM and get an answer.
We use a second LLM to rate and validate the output of the first LLM.
Send the feedback back to the first LLM, asking to incorporate it.

This works very well to make LLM outputs more reliable and stable.
Example main prompt (system prompt):
"You are a system to rate blog articles regarding SEO best practices."
Generate in-page SEO suggestions for the user's article.
If the user provides feedback, integrate the feedback and respond with a better version of your previous response."
Example reflection prompt
"You are a system to grade SEO recommendations. Generate feedback for the user's SEO recommendations.
Be very strict, as we want to learn from your feedback."
Tool of the week
LLMDatasets - a collection of high-quality datasets for LLM fine-tuning
Fine-tuning is still a very good method to change the behaviour of models and create high-performing outputs for specialised use-cases.
For fine-tuning, data is the most valuable asset. While datasets can't be directly evaluated like models, high-quality datasets have the following characteristics:
Accuracy: Samples should be factually correct, helpful to users, and well-written. Answers should also be relevant to their corresponding instructions.
Diversity: You want to cover as many use cases as possible to ensure proper instruction-following and relevant answers. This requires a wide range of topics, contexts, lengths, writing styles, etc. sampled in a representative way.
Complexity: Answers should be nontrivial and a) representative of tasks you expect the model to handle or b) include complex tasks involving multi-step reasoning, planning, etc.
And it’s very hard to come up with such datasets. That’s where the open source Github repository “LLMDatasets” comes into play.
It’s one of the most comprehensive collections of high-quality datasets out there. Most datasets are already formatted to be used for fine-tuning.
Currently, the following topics are covered:
General purpose, synthetic data: A diverse mix of real-world and synthetic data, to transform base models into versatile and capable assistants.
Math & Logic: extend beyond pure mathematics, encompassing a wide range of problems that require systematic thinking and step-by-step reasoning, ultimately enabling LLMs to tackle complex real-world challenges
Code: containing diverse programming language examples, are used to fine-tune LLMs and enhance their ability to understand, generate, and analyze code
Conversation & Role-Play: expose LLMs to the patterns, nuances, and context-dependent nature of real conversations, allowing them to generate more natural, and engaging dialogues
Agent & Function calling: Function calling allows LLMs to execute predefined functions with parameters inferred from user prompts, rather than generating standard text responses, enabling them to integrate with external systems.
For more information and examples, visit the LLMDatasets Github repository.
We hope you liked our newsletter and you stay tuned for the next edition. If you need help with your AI tasks and implementations - let us know. We are happy to help