The Goodwill Principle
Years ago, a friend shared a fact about Goodwill Industries that stopped me. Not because it was scandalous. Because it was surprisingly elegant.
Inside Goodwill's sheltered workshops - where they employed people with a wide range of cognitive and physical disabilities - tasks weren't assigned randomly. They were organized by the minimum intelligence required to perform them successfully. Workers were evaluated, assessed, and matched to work that aligned with their functional capability. Historical manuals from Goodwill affiliates in the 1970s explicitly referenced intelligence ranges. Roles requiring "dull-normal range or better" (roughly IQ 80–90+) went to some workers. Simpler, more repetitive tasks went to others who needed more support.
The logic was sound. Don't waste capability on work that doesn't require it. Don't assign work beyond someone's ability to handle it. This allowed Goodwill to put tens of thousands of people to work who would have otherwise had no path to employment, by ensuring that every task in their ecosystem had a clearly understood cognitive floor.
Nobody would have assigned their highest-functioning workers to stuff envelopes all day. That would have been a waste of potential, capability, and resources.
This same principle - largely forgotten in its original context - has become one of the most expensive unsolved problems in modern AI.
The dominant attitude today is that everything requiring inference should use the latest, most intelligent model available. Need to extract a date from a document? Top-tier model. Writing a Git commit message? Top-tier model. Summarizing a support ticket? Top-tier model. Organizations are spending thousands to millions per month on inference, with a massive and largely invisible portion of that spend going toward intelligence the task simply does not require.
This is the AI equivalent of deploying your most capable people to stuff envelopes.
It's not just wasteful. It's a symptom of something deeper. We have no widely adopted discipline for evaluating the minimum intelligence that any given task actually requires. Smaller, faster, cheaper models are remarkably capable at tasks with well-defined structure, limited ambiguity, and narrow output requirements. Summarizing your emails. Reformatting data between systems. Drafting a first-pass reply to a support ticket. Extracting a date from a contract. Routing an inbound request to the right team. These are envelope-stuffing tasks. They don't need a PhD. They need reliability, speed, and low cost.
Yet the tooling, the culture, and the frameworks to make this distinction confidently and systematically barely exist.
Organizations don't overspend on inference because they need more intelligence. They overspend because they have no framework for knowing how much intelligence they need.
What would it look like to borrow Goodwill's logic? The core question is deceptively simple: what is the minimum model capability required to complete this task reliably? Does the task have a well-defined correct answer, or does it require judgment across ambiguous inputs? Is the expected output narrow and verifiable, or open-ended and evaluative? What happens when the model gets it wrong - does a human catch it, or does the error propagate silently? Does it require integrating information across a long, complex context, or is it essentially stateless? Does it require multi-step reasoning, or is it a pattern match?
These aren't theoretical questions. They're the dimensions of a cognitive floor. Organizations that map them systematically will find that a large proportion of their inference workload sits well below the capability ceiling of the models they're currently throwing at it.
What makes this harder than it sounds is that many tasks feel cognitively complex when they're actually just poorly defined. A task that seems to demand a top-tier model is often two or three simpler tasks collapsed together, none of which individually require that level of intelligence. The work isn't complex. It just lacks structure.
This is where the most capable models earn their keep - not as permanent workhorses, but as tools for decomposition. Use them to break ambiguous work into well-defined sub-tasks that smaller, cheaper models can handle reliably. Then get out of the way. The same way a human shouldn't be doing repetitive work that can be automated, a top-tier reasoning model shouldn't be running tasks that only require pattern matching. Ongoing reliance on the most powerful model is a failure of task design, not a reflection of task difficulty.
Goodwill's workshop model was brilliant. The operational insight embedded in it was ahead of its time in ways nobody anticipated. Work has a cognitive floor. Matching capability to requirement is both efficient and respectful of resources, whether those resources are human potential or inference spend.
The organizations that internalize this won't just save money. They'll build a compounding advantage: faster pipelines, more predictable outputs, and the freedom to deploy their most capable models where capability actually matters.
Don't apply more intelligence than the task requires. Don't apply less. Knowing the difference is the hard part. But it's the part worth solving.
Subscribe to new posts
Get new posts delivered to your inbox.