Pondhouse Data OG - We know data & AI
Posts
Pondhouse Data AI - Tips & Tutorials for Data & AI 31

Pondhouse Data AI - Tips & Tutorials for Data & AI 31

GPT-5: Big leap towards AGI? | WebAgent for Real Web Tasks | Shift LLM Load to Clients | When Models Think Too Much

Andreas Nigg
12 Aug

Hey there,

This week, we’re diving into a mix of hands-on techniques and industry shakeups. We’ll show you how to offload LLM calls to the client using FastMCP’s sampling method, explore WebAgent from Alibaba for real web interaction, and take a hard look at OpenAI’s GPT‑5 release—a launch that sparked more confusion than celebration.

You'll also find updates from ElevenLabs, Google, and OpenAI in the "Also in the News" section, plus one tip that could change how you think about agent reliability: Context Engineering from the team behind Manus.

Enjoy the read!

Cheers, Andreas & Sascha

In today's edition:

📚 Tutorial of the Week: Offload LLM Inference to Clients with FastMCP’s Sampling Technique

🛠️ Tool Spotlight: Alibaba’s WebAgent – An Open Framework for Real Web Automation

📰 Top News: GPT‑5: Overhyped Launch, Modest Gains, and a Bumpy Live Demo

💡 Tip: Context Engineering – How Manus Designs Agents That Don't Fall Apart

Let's get started!

Find this Newsletter helpful?
Please forward it to your colleagues and friends - it helps us tremendously.

Tutorial of the week

Scale LLMs Smartly with FastMCP’s LLM‑Sampling

In this week’s tutorial, we introduce a clever technique for anyone building with MCP servers: LLM Sampling with FastMCP. Instead of letting your server handle every language model call—which can quickly become a bottleneck—you can delegate those calls to the client. This keeps your server logic clean and scalable, while still delivering full AI-powered experiences. Whether you're running agents, workflows, or API-connected assistants, this pattern can dramatically reduce load and cost.

Why LLM Sampling Matters:

Improved Scalability: Server avoids costly LLM compute loads, enabling higher user concurrency.
Cost Efficiency: Clients handle model calls—so you reduce centralized processing costs.
Greater Flexibility: Clients can use local or preferred LLMs for inference.
Lower Latency: By bypassing server-side queuing, response times improve.

How It Works:

The server sends a prompt to the client via FastMCP’s context object, like:

response = await ctx.sample(
    messages=f"Summarize this document:\n{document_text}",
    system_prompt="Extract key ideas, then summarize.",
    temperature=0.7,
    max_tokens=300,
)

The client handles the LLM request and returns the result—just like the server did it itself.

What You’ll Learn:

Setting up ctx.sample in your FastMCP tools.
Implementing a sampling handler on the client to process LLM calls.
A fully functional example demonstrating how server logic and client inference team up.

Try it now to keep your MCP server light, fast, and cost-efficient—while still leveraging powerful AI workflows.

Read the full tutorial here: LLM Sampling with FastMCP

LLM Sampling with FastMCP: Using Client LLMs for Scalable AI Workflows in your MCP

Discover how to use LLM sampling in FastMCP to delegate language model calls to the client, enabling scalable and efficient AI-powered MCP servers.

www.pondhouse-data.com/blog/llm-sampling-with-fastmcp

Tool of the week

WebAgent – Alibaba’s Fully Open Source Browser Agent

This week’s featured tool comes from Alibaba’s NLP team: WebAgent, a powerful, fully open-source framework for training and running web-based AI agents. These agents can interact with real websites—clicking, typing, searching, and navigating—as if they were human users. And unlike most agent frameworks, WebAgent includes both the model and the data, so you can train, fine-tune, or adapt it to your own use case without starting from scratch.

🧠 What Makes WebAgent Stand Out:

Pretrained Agent Model (WebAgentLM): A 7B open-source language model trained specifically for web interaction tasks.
Built-In Web Environment: Includes a sandboxed Chrome environment with instrumentation for capturing actions and DOM info.
Training Dataset Included: Comes with WebAgentInstruct, a dataset of 226k web interaction demonstrations across thousands of websites.
Task Types: From booking flights to changing settings in a dashboard—WebAgent is designed for real-world web automation.

🛠️ Use Cases:

Web-based RPA and automation
In-browser AI copilots
Agent training on custom domains and sites
AI-driven form filling, searching, and UI testing

With the growing interest in AI agents that can operate real apps and tools, WebAgent is one of the most complete and production-relevant open tools available today.

🔗 Explore WebAgent on GitHub

Alibaba-NLP/WebAgent: 🌐 WebAgent for Information Seeking

WebAgent, a powerful, fully open-source framework for training and running web-based AI agents

github.com/Alibaba-NLP/WebAgent

GPT‑5 – Big Step Toward AGI or Just a Small Bump Forward?

After months of heavy speculation and ambitious marketing, OpenAI has officially launched GPT‑5—a model that was widely expected to represent a major leap toward AGI. And while it's certainly better, it’s not quite the breakthrough many anticipated.

According to benchmarks from LLM Stats, GPT‑5 consistently outperforms Anthropic’s Claude Sonnet 4 and Google’s Gemini 2.5 Pro across most academic, coding, and reasoning tasks. But the lead is often incremental, not transformational. On SWE-bench, for instance, GPT‑5 reaches 74.9% accuracy with “thinking” mode enabled—well above GPT‑4o and earlier models, but not vastly better than Claude Opus 4.1, which lands in the low 70s on similar tests. It’s a clear improvement, just not the AGI moment some were hoping for.

Even more disappointing than the modest benchmark gains was the botched launch event. The live stream included completely inaccurate charts and code demos that didn’t even run—with experienced developers like Jeremy Howard pointing it out publicly. Combined with sky-high expectations, these issues led to confusion and frustration rather than excitement.

That said, GPT‑5 does introduce a few promising usability features, such as an automatic routing mechanism that switches between fast and “thinking” inference modes depending on query complexity. This simplifies interaction and is a clear nod to OpenAI’s focus on broader accessibility and productization.

You can read the full announcement here: Introducing GPT‑5
Or watch the full (and bumpy) release event: Launch Video

While GPT‑5 is clearly an evolution, not a revolution, it still raises the bar—just not in the way the AGI-hungry crowd was hoping for.

Also in the news

ElevenLabs Launches AI Music Generation

ElevenLabs, best known for its AI voice technology, has now entered the world of music generation. With the new tool, users can generate original songs complete with vocals, instrumentation, and lyrics—all from a single prompt. It supports various genres and styles, and the results are impressively coherent and production-ready. While it’s still early, this release positions ElevenLabs as a serious player in AI-generated audio beyond speech.

🔗 Try it out

Google Releases Gemini Embeddings + Flash-Lite Model at Shockingly Low Price

Google has quietly launched its Gemini embedding model, now available through the Gemini API, making it easier to build retrieval-augmented systems within Google’s ecosystem. In addition, the Gemini 2.5 Flash-Lite model is now stable and generally available—and its pricing is aggressive: just $0.04 per million input tokens and $0.15 per million output tokens, undercutting most competitors. It’s optimized for high-speed tasks where cost and latency matter.

🔗 Gemini Embeddings
🔗 Flash-Lite Release & Pricing

Thinking Too Much Hurts – New Paper Reveals “Inverse Scaling” at Test-Time

A new study shows that too much reasoning can actually degrade performance in large reasoning models (LRMs). The paper introduces evaluation tasks—like counting with distractors or deductive puzzles—where models perform worse as they reason longer. The research highlights five failure modes, including distraction from irrelevant data, overfitting to framing, and even increased self-preservation language in some models (e.g., Claude Sonnet 4) when given extended test-time compute. This study is a key reminder: more compute isn’t always better.

🔗 Read the full paper

Google and OpenAI Roll Out Learning-Focused AI Features

Both Google and OpenAI are pushing into AI-assisted education. Google introduced a new “Guided Learning” mode across its platforms, offering step-by-step learning experiences powered by Gemini. Meanwhile, OpenAI launched “Study Mode” inside ChatGPT, enabling students to review topics with summaries, flashcards, and even practice questions—all AI-generated. This signals a clear trend: AI is becoming more than a tool—it’s becoming a tutor.

🔗 Google Guided Learning
🔗 OpenAI Study Mode

Tip of the week

Context Engineering — The Real Secret Behind Reliable AI Agents

In their insightful blog post, the Manus team reveals why success in AI agents isn’t just about smarter models—it’s about how you present information to them. They call this art and science Context Engineering, and it’s critical for agents that need to think, act, and recover consistently.

Core Principles from Manus' Context Playbook:

Optimize KV-Cache Performance: Use a stable prompt prefix and append-only context to ensure cache hits—key to fast, cheap responses.
Mask, Don’t Remove Tools: Instead of dynamically dropping tool options, Manus masks irrelevant ones to preserve cache integrity and avoid confusion.
Use File System as Memory: Shifting long-term context to files reduces token clutter and keeps important state accessible without bloating input.
Recite Goals with a “todo.md”: Regularly rewrite objectives at the end of context to keep the model focused, even in long multi-step tasks.
Keep Mistakes Visible: Rather than erasing failures, include them so the agent learns to avoid repeating errors—true adaptability in action.
Introduce Controlled Variability: Avoid falling into repetitive behaviors by mixing up format, phrasing, or serialization to steer the model away from "copy-paste" loops.

Why It Matters:

Context engineering isn’t a minor detail—it defines how fast your agents run, how well they recover from errors, and how efficiently they scale across tasks. Manus learned these through experimental choreography—or what they call “Stochastic Graduate Descent”—and it makes all the difference.

Explore how to engineer context with precision:
Read the full post on Context Engineering by Manus

Context Engineering for AI Agents: Lessons from Building Manus

This post shares the local optima Manus arrived at through our own "SGD". If you're building your own AI agent, we hope these principles help you converge faster.

manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus

We hope you liked our newsletter and you stay tuned for the next edition. If you need help with your AI tasks and implementations - let us know. We are happy to help