Episode 49: Why Data and AI Still Break at Scale (and What to Do About It)
If we want AI systems that actually work in production, we need better infrastructure—not just better models.
In this episode, Hugo talks with Akshay Agrawal (Marimo, ex-Google Brain, Netflix, Stanford) about why data and AI pipelines still break down at scale, and how we can fix the fundamentals: reproducibility, composability, and reliable execution.
They discuss:
🔁 Why reactive execution matters—and how current tools fall short
🛠️ The design goals behind Marimo, a new kind of Python notebook
⚙️ The hidden costs of traditional workflows (and what breaks at scale)
📦 What it takes to build modular, maintainable AI apps
🧪 Why debugging LLM systems is so hard—and what better tooling looks like
🌍 What we can learn from decades of tools built for and by data practitioners
Toward the end of the episode, Hugo and Akshay walk through two live demos: Hugo shares how he’s been using Marimo to prototype an app that extracts structured data from world leader bios, and Akshay shows how Marimo handles agentic workflows with memory and tool use—built entirely in a notebook.
This episode is about tools, but it’s also about culture. If you’ve ever hit a wall with your current stack—or felt like your tools were working against you—this one’s for you.
LINKS
* marimo | a next-generation Python notebook (https://marimo.io/)
* SciPy conference, 2025 (https://www.scipy2025.scipy.org/)
* Hugo's face Marimo World Leader Face Embedding demo (https://www.youtube.com/watch?v=DO21QEcLOxM)
* Vanishing Gradients YouTube Channel (https://www.youtube.com/channel/UC_NafIo-Ku2loOLrzm45ABA)
* Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk)
* Hugo's recent newsletter about upcoming events and more! (https://hugobowne.substack.com/p/ai-as-a-civilizational-technology)
* Watch the podcast here on YouTube! (https://youtube.com/live/WVxAz19tgZY?feature=share)
🎓 Want to go deeper?
Check out Hugo's course: Building LLM Applications for Data Scientists and Software Engineers.
Learn how to design, test, and deploy production-grade LLM systems — with observability, feedback loops, and structure built in.
This isn’t about vibes or fragile agents. It’s about making LLMs reliable, testable, and actually useful.
Includes over $800 in compute credits and guest lectures from experts at DeepMind, Moderna, and more.
Cohort starts July 8 — Use this link for a 10% discount (https://maven.com/hugo-stefan/building-llm-apps-ds-and-swe-from-first-principles?promoCode=LLM10)
--------
1:21:45
Episode 48: HOW TO BENCHMARK AGI WITH GREG KAMRADT
If we want to make progress toward AGI, we need a clear definition of intelligence—and a way to measure it.
In this episode, Hugo talks with Greg Kamradt, President of the ARC Prize Foundation, about ARC-AGI: a benchmark built on Francois Chollet’s definition of intelligence as “the efficiency at which you learn new things.” Unlike most evals that focus on memorization or task completion, ARC is designed to measure generalization—and expose where today’s top models fall short.
They discuss:
🧠 Why we still lack a shared definition of intelligence
🧪 How ARC tasks force models to learn novel skills at test time
📉 Why GPT-4-class models still underperform on ARC
🔎 The limits of traditional benchmarks like MMLU and Big-Bench
⚙️ What the OpenAI O₃ results reveal—and what they don’t
💡 Why generalization and efficiency, not raw capability, are key to AGI
Greg also shares what he’s seeing in the wild: how startups and independent researchers are using ARC as a North Star, how benchmarks shape the frontier, and why the ARC team believes we’ll know we’ve reached AGI when humans can no longer write tasks that models can’t solve.
This conversation is about evaluation—not hype. If you care about where AI is really headed, this one’s worth your time.
LINKS
* ARC Prize -- What is ARC-AGI? (https://arcprize.org/arc-agi)
* On the Measure of Intelligence by François Chollet (https://arxiv.org/abs/1911.01547)
* Greg Kamradt on Twitter (https://x.com/GregKamradt)
* Hugo's High Signal Podcast with Fei-Fei Li (https://high-signal.delphina.ai/episode/fei-fei-on-how-human-centered-ai-actually-gets-built)
* Vanishing Gradients YouTube Channel (https://www.youtube.com/channel/UC_NafIo-Ku2loOLrzm45ABA)
* Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk)
* Hugo's recent newsletter about upcoming events and more! (https://hugobowne.substack.com/p/ai-as-a-civilizational-technology)
* Watch the podcast here on YouTube! (https://youtu.be/wU82fz4iRfo)
🎓 Want to go deeper?
Check out Hugo's course: Building LLM Applications for Data Scientists and Software Engineers.
Learn how to design, test, and deploy production-grade LLM systems — with observability, feedback loops, and structure built in.
This isn’t about vibes or fragile agents. It’s about making LLMs reliable, testable, and actually useful.
Includes over $800 in compute credits and guest lectures from experts at DeepMind, Moderna, and more.
Cohort starts July 8 — Use this link for a 10% discount (https://maven.com/hugo-stefan/building-llm-apps-ds-and-swe-from-first-principles?promoCode=LLM10)
--------
1:04:25
Episode 47: The Great Pacific Garbage Patch of Code Slop with Joe Reis
What if the cost of writing code dropped to zero — but the cost of understanding it skyrocketed?
In this episode, Hugo sits down with Joe Reis to unpack how AI tooling is reshaping the software development lifecycle — from experimentation and prototyping to deployment, maintainability, and everything in between.
Joe is the co-author of Fundamentals of Data Engineering and a longtime voice on the systems side of modern software. He’s also one of the sharpest critics of “vibe coding” — the emerging pattern of writing software by feel, with heavy reliance on LLMs and little regard for structure or quality.
We dive into:
• Why “vibe coding” is more than a meme — and what it says about how we build today
• How AI tools expand the surface area of software creation — for better and worse
• What happens to technical debt, testing, and security when generation outpaces understanding
• The changing definition of “production” in a world of ephemeral, internal, or just-good-enough tools
• How AI is flattening the learning curve — and threatening the talent pipeline
• Joe’s view on what real craftsmanship means in an age of disposable code
This conversation isn’t about doom, and it’s not about hype. It’s about mapping the real, messy terrain of what it means to build software today — and how to do it with care.
LINKS
* Joe's Practical Data Modeling Newsletter on Substack (https://practicaldatamodeling.substack.com/)
* Joe's Practical Data Modeling Server on Discord (https://discord.gg/HhSZVvWDBb)
* Vanishing Gradients YouTube Channel (https://www.youtube.com/channel/UC_NafIo-Ku2loOLrzm45ABA)
* Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk)
🎓 Want to go deeper?
Check out my course: Building LLM Applications for Data Scientists and Software Engineers.
Learn how to design, test, and deploy production-grade LLM systems — with observability, feedback loops, and structure built in.
This isn’t about vibes or fragile agents. It’s about making LLMs reliable, testable, and actually useful.
Includes over $800 in compute credits and guest lectures from experts at DeepMind, Moderna, and more.
Cohort starts July 8 — Use this link for a 10% discount (https://maven.com/hugo-stefan/building-llm-apps-ds-and-swe-from-first-principles?promoCode=LLM10)
--------
1:19:12
Episode 46: Software Composition Is the New Vibe Coding
What if building software felt more like composing than coding?
In this episode, Hugo and Greg explore how LLMs are reshaping the way we think about software development—from deterministic programming to a more flexible, prompt-driven, and collaborative style of building. It’s not just hype or grift—it’s a real shift in how we express intent, reason about systems, and collaborate across roles.
Hugo speaks with Greg Ceccarelli—co-founder of SpecStory, former CPO at Pluralsight, and Director of Data Science at GitHub—about the rise of software composition and how it changes the way individuals and teams create with LLMs.
We dive into:
- Why software composition is emerging as a serious alternative to traditional coding
- The real difference between vibe coding and production-minded prototyping
- How LLMs are expanding who gets to build software—and how
- What changes when you focus on intent, not just code
- What Greg is building with SpecStory to support collaborative, traceable AI-native workflows
- The challenges (and joys) of debugging and exploring with agentic tools like Cursor and Claude
We’ve removed the visual demos from the audio—but you can catch our live-coded Chrome extension and JFK document explorer on YouTube. Links below.
JFK Docs Vibe Coding Demo (YouTube) (https://youtu.be/JpXCkuV58QE)
Chrome Extension Vibe Coding Demo (YouTube) (https://youtu.be/ESVKp37jDwc)
Meditations on Tech (Greg’s Substack) (https://www.meditationsontech.com/)
Simon Willison on Vibe Coding (https://simonwillison.net/2025/Mar/19/vibe-coding/)
Johnno Whitaker: On Vibe Coding (https://johnowhitaker.dev/essays/vibe_coding.html)
Tim O’Reilly – The End of Programming (https://www.oreilly.com/radar/the-end-of-programming-as-we-know-it/)
Vanishing Gradients YouTube Channel (https://www.youtube.com/channel/UC_NafIo-Ku2loOLrzm45ABA)
Upcoming Events on Luma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk)
Greg Ceccarelli on LinkedIn (https://www.linkedin.com/in/gregceccarelli/)
Greg’s Hacker News Post on GOOD (https://news.ycombinator.com/item?id=43557698)
SpecStory: GOOD – Git Companion for AI Workflows (https://github.com/specstoryai/getspecstory/blob/main/GOOD.md)
🎓 Want to go deeper?
Check out my course: Building LLM Applications for Data Scientists and Software Engineers.
Learn how to design, test, and deploy production-grade LLM systems — with observability, feedback loops, and structure built in.
This isn’t about vibes or fragile agents. It’s about making LLMs reliable, testable, and actually useful.
Includes over $2,500 in compute credits and guest lectures from experts at DeepMind, Moderna, and more.
Cohort starts April 7 — Use this link for a 10% discount (https://maven.com/hugo-stefan/building-llm-apps-ds-and-swe-from-first-principles?promoCode=LLM10)
🔍 Want to help shape the future of SpecStory?
Greg and the team are looking for design partners for their new SpecStory Teams product—built for collaborative, AI-native software development.
If you're working with LLMs in a team setting and want to influence the next wave of developer tools, you can apply here:
👉 specstory.com/teams (https://specstory.com/teams)
--------
1:08:57
Episode 45: Your AI application is broken. Here’s what to do about it.
Too many teams are building AI applications without truly understanding why their models fail. Instead of jumping straight to LLM evaluations, dashboards, or vibe checks, how do you actually fix a broken AI app?
In this episode, Hugo speaks with Hamel Husain, longtime ML engineer, open-source contributor, and consultant, about why debugging generative AI systems starts with looking at your data.
In this episode, we dive into:
Why “look at your data” is the best debugging advice no one follows.
How spreadsheet-based error analysis can uncover failure modes faster than complex dashboards.
The role of synthetic data in bootstrapping evaluation.
When to trust LLM judges—and when they’re misleading.
Why most AI dashboards measuring truthfulness, helpfulness, and conciseness are often a waste of time.
If you're building AI-powered applications, this episode will change how you approach debugging, iteration, and improving model performance in production.
LINKS
The podcast livestream on YouTube (https://youtube.com/live/Vz4--82M2_0?feature=share)
Hamel's blog (https://hamel.dev/)
Hamel on twitter (https://x.com/HamelHusain)
Hugo on twitter (https://x.com/hugobowne)
Vanishing Gradients on twitter (https://x.com/vanishingdata)
Vanishing Gradients on YouTube (https://www.youtube.com/channel/UC_NafIo-Ku2loOLrzm45ABA)
Vanishing Gradients on Twitter (https://x.com/vanishingdata)
Vanishing Gradients on Lu.ma (https://lu.ma/calendar/cal-8ImWFDQ3IEIxNWk)
Building LLM Application for Data Scientists and SWEs, Hugo course on Maven (use VG25 code for 25% off) (https://maven.com/s/course/d56067f338)
Hugo is also running a free lightning lesson next week on LLM Agents: When to Use Them (and When Not To) (https://maven.com/p/ed7a72/llm-agents-when-to-use-them-and-when-not-to?utm_medium=ll_share_link&utm_source=instructor)
A podcast about all things data, brought to you by data scientist Hugo Bowne-Anderson.
It's time for more critical conversations about the challenges in our industry in order to build better compasses for the solution space! To this end, this podcast will consist of long-format conversations between Hugo and other people who work broadly in the data science, machine learning, and AI spaces. We'll dive deep into all the moving parts of the data world, so if you're new to the space, you'll have an opportunity to learn from the experts. And if you've been around for a while, you'll find out what's happening in many other parts of the data world.