The Lean Discipline of AI Adoption
The Model Is the Cheapest Part
If you’re reading this, you’ve lived through three or four of these moments in your career already.
Mobile in 2010. Big data in 2014. “Every company is a software company” around 2017. The shape is always the same. A real and important capability arrives. Everyone scrambles. The vendor list explodes. Pilots launch. Most of them produce nothing measurable.
And then, quietly, eighteen months later, the companies that come out of the wave with actual leverage turn out not to be the ones with the most pilots, or the biggest budget, or the earliest start. They’re the ones who installed a discipline for telling real progress from looking-busy progress, before they started spending.
AI is the same shape. If you’re paying close attention to it now, you’re not behind. You’re early to the part that matters.
The fix isn’t a vendor. It isn’t a model. It’s a discipline. And the books that name it were written years before AI showed up.
The two books, fast
Eric Ries wrote The Lean Startup in 2011. Most people remember the MVP. The load-bearing idea is validated learning. Measuring whether you actually learned something, not whether you were busy. He called the discipline that supports it innovation accounting. Vanity metrics that always go up vs. actionable metrics tied to a specific hypothesis. That book is the team-scale playbook.
Three years later, Jez Humble, Joanne Molesky, and Barry O’Reilly wrote Lean Enterprise. The part structure is literally “Explore” and “Exploit.” Explore and exploit are different management problems. Different governance. Different funding cadence. Different metrics. Different leadership behavior. Mission Command: the leader sets the outcome and the boundary, the team owns the execution. That replaces command-and-control. The book is the org-scale playbook.
Both were prescient. Both were ignored. Most of what we’re now calling AI failure is the cost of having ignored them.
Think about the first time you cooked something new. Not the Food Network version where everything plates beautifully on the first take. The real first attempt. You read the recipe twice. You measured everything. You didn’t multitask. You did one thing at a time, slowly, because you were learning what good looked like. You also burned something. Undersalted something. Took twice as long as the recipe claimed. Figured out which step the recipe was quietly lying about. By the fifth Tuesday, it was just dinner, done while you scrolled your phone, no recipe card needed. The polished show skips the part that mattered. AI workflows work the same way. There’s no shortcut around the careful first attempt. There’s just the question of whether you took it.
Here’s how the recipe works for a team. If you’re a leader trying to bring AI into a workflow your team already owns, three steps, in order. At each step the team has a job and you have a job. Yours is what makes theirs possible. Skip yours and you’ll have a demo that doesn’t survive a Tuesday.
1. Spike. Smallest test on the strongest tool.
The team picks one real workflow they already touch every week. Runs it end-to-end on the strongest model available. Doesn’t optimize the prompt. Doesn’t shop for tools. Doesn’t throttle the cost. The team’s only job is to find out whether this work is AI-doable at all today, by anyone. If the strongest model can’t, no cheaper one will.
What the leader sets up is the explore budget. Time. Model credits. No PR-ready outcome required. The hypothesis written down before the work starts, in one paragraph: what would we learn from this? And a stop-date. Mission Command in plain English: here’s the outcome we’re testing, here’s the boundary, come back in two weeks with what you learned. Leaders who turn this into a multi-quarter roadmap kill it before it starts.
2. Specify. What good actually looks like.
Almost everyone skips Specify. Ries called it load-bearing. Most AI dashboards quietly fail at it.
The team takes one good run from the Spike and saves five to ten real inputs alongside the outputs they accepted as good. That’s the golden set. They add a daily human gate, ten minutes reviewing new runs against the bar. They capture the failure modes, not just the wins. (This is the how underneath the Automate step from DEAL Naming what to automate is half the work. Learning what good looks like is the other half.)
The leader’s job is the review ritual. A fifteen-minute weekly meeting where the team shows the golden set, not the dashboard. Vanity metrics: tokens consumed, seats deployed, prompts issued, dashboard views. Actionable metrics: workflow success rate on real inputs and review time reduced. Visible air cover that the team is allowed to spend time on this, not just on shipping.
This is where most enterprise AI programs die. The leader loves the Spike demo, skips the Specify ritual, and the learning evaporates by the next quarter. The dashboard keeps growing. The validated learning never accumulates. What looks like an AI failure isn’t a technology problem. It’s a governance problem. And both books named it years before the AI era began.
3. Settle. Graduate from explore to exploit.
Now the team runs the same inputs through the next tier down. If the outputs hold against the golden set, settle there. Persevere, optimized. If the outputs drop, you’ve located the part of the workflow that needs the frontier. Split the pipeline at that seam. Pivot, with precision.
A developer named Pawel ran exactly this, in public, with a personal AI agent. Started everything on Opus. By Friday he was at 70 to 80% of his weekly Claude Max limit. He swapped the default to Haiku for around 95% of the agent’s tasks (the structured, precise, follow-the-checklist work), and left the heavier reasoning for Sonnet and Opus. Weekly limit usage dropped to about 40%. Same output. The title of his post is the punchline: “And it got better.”
What the leader runs is the graduation handoff. The experiment leaves the explore budget and enters the exploit budget with the golden set, the failure modes, and the runtime economics attached. Lean Enterprise warns about this handoff most loudly. Most orgs fumble it. Either pilots stay pilots forever (no graduation), or they scale without the learning (the golden set gets left behind and the next team rebuilds it from scratch). The handoff ritual is small. The discipline lives in the handoff.
When the discipline is in place, the economics show up on their own. Notion published a 90% cost reduction and up to 85% lower latency from prompt caching across their multi-model architecture. dbt Labs is cited in the same case study as saving over $35,000 a year by consolidating onto Notion AI instead of buying additional tools. The caching is just a technique. The discipline behind it (knowing which workloads are routine enough to cache) is the leverage.
The mail on my kitchen table
The point of doing one of these loops yourself isn’t that your weekends should involve AI. It’s that if I can’t describe what I learned from parsing my own mail, I can’t ask my team to describe what they learned from their pilot.
About a year ago I started a small home automation. Real envelopes: utility bills, school flyers, statements, the junk that piles up by the door. Plus the PDFs my scanner spits out. The first weekend, I threw the entire stack at the strongest model I had access to back then. By today’s standards, an objectively dumber model. No prompt engineering, just to find out what was AI-doable. It was, barely. Then I built the golden set from real mail and stepped down to a mid-tier model.
Two failure modes surfaced at the cheaper tier, both of which I would not have invented as synthetic tests. A utility bill with a handwritten note from my wife on it. The cheaper tier missed the annotation entirely, even though the note changed the action I needed to take. A tax form mailed inside another envelope. The cheaper tier conflated the two documents and routed both as the wrapping mail. Five real inputs caught these. Fifty made-up examples wouldn’t have.
A year on, the workflow still runs. The models have all gotten stronger. The discipline is the same. The model is the cheapest part of the whole thing.
Those ten minutes I spend writing down what didn’t work are the most valuable part of my week on this project. The parts that work tell me the system is alive. The failures tell me where the seam is.
Where Lean stops working
Not every workflow downgrades cleanly. Long-context reasoning, multi-step planning, and deep code review on a real codebase often need to stay on the frontier. The pattern is downgrade where you can, not downgrade always. Settle is conditional, not automatic.
Latency and privacy bound model choice before cost does. Sometimes the right model is neither the cheapest nor the strongest. The framework assumes cost is the active constraint. If it isn’t, you’re solving a different problem first.
And Lean fits the explore side of the AI question, not the vision side. Ted Ladd’s HBR research from 2016 on 250 cleantech accelerator teams found something worth holding onto: more market tests doesn’t beat strong strategy. Whether to build an AI capability at all (the vision question) needs conviction. How to turn today’s models into useful workflow (the explore question) needs Lean discipline. This post is about the second. The first is a different book.
The discipline you install
The next generation of leaders worth following won’t be the ones who picked the right AI vendor. They’ll be the ones who installed a learning discipline in their teams before the market forced them to. If you’re asking these questions now, you’re ahead of most of the market and ahead of most of your peers. The K-shaped curve I wrote about earlier this year isn’t really about who has the best tools. It’s about who built this muscle before the people next to them did.
Validated learning is the executive discipline for AI adoption.
Cost is downstream of learning.
Model choice is downstream of both.
So here is the question I’d leave you with. Actually, two questions. One for your team, one for you.
For your team this week: ask them what the golden set is for the thing they’re trying to automate. Not the dashboard. Not the weekly demo. The five real inputs and the outputs the team agreed were good. If the answer is we don’t have one yet, that’s where to start. Protect the time for them to build it before anything else.
For yourself, optionally: if you haven’t run one of these loops personally in the last three months, pick something mundane of your own. Receipts. Mail. A weekly report. Run the spike yourself, so that when your team shows you their golden set, you can tell whether it’s real.
#Leadership #AI #FutureOfWork #LeanStartup #ValidatedLearning #AIAdoption #EngineeringLeadership


