Slim Harness, Fat Skill: The True Source of 100x AI Productivity

MarsBitNews · 2026-04-13T05:24:45+00:00

Original Title: Thin Harness, Fat Skills Author: Garry Tan Translation: Peggy, BlockBeats Original Author: Rhythm BlockBeats Original Source: Reprint: Mars Financial Editor's Note: When "more powerful models" become the industry's default answer, this article offers a different perspective: what truly creates a 10x, 100x, or even 1000x productivity gap is not the model itself, but the entire system design built around the model. The author of this article, Garry Tan, is currently the President and CEO of Y Combinator, with long-term deep expertise in...

MarsBitNews

2026-04-13 05:24:45

Title: Thin Harness, Fat Skills
Author: Garry Tan
Translation: Peggy, BlockBeats

Original author: Rhythm BlockBeats

Original source:

Repost: Mars Finance

Editor’s note: When “more powerful models” become the industry’s default answer, this article offers a different perspective: the true factor that creates a 10x, 100x, or even 1000x productivity gap is not the model itself, but the entire system design built around the model.

The author of this article, Garry Tan, is currently President and CEO of Y Combinator, with long-term focus on AI and early startup ecosystems. He proposes the framework of “fat skills + thin harness,” breaking down AI applications into key components such as skills, operational frameworks, context routing, task division, and knowledge compression.

Under this system, models are no longer the entirety of capability but just execution units within the system; the real determinant of output quality is how you organize context, refine processes, and delineate the boundaries between “judgment” and “computation.”

More importantly, this approach is not just conceptual but has been validated in real scenarios: facing data processing and matching tasks for thousands of entrepreneurs, the system, through a cycle of “read—organize—judge—write back,” achieves near-human analyst capabilities and continuously self-optimizes without rewriting code. This “learning system” transforms AI from a one-time tool into a foundational infrastructure with compound effects.

Thus, the core reminder from this article becomes clear: in the AI era, the productivity gap no longer depends on whether you use the most advanced model, but on whether you build a system capable of continuous capability accumulation and automatic evolution.

Below is the original text:

Steve Yegge said, using AI programming agents, “Efficiency is 10 to 100 times that of engineers who only use Cursor and chat tools to write code, roughly 1,000 times that of Google engineers around 2005.”

This is not an exaggeration. I have seen and experienced it firsthand. But when people hear such a gap, they often attribute it incorrectly: to more powerful models, smarter Claude, more parameters.

In reality, the person who improves efficiency by 2x and the one who improves by 100x are using the same set of models. The difference is not in “intelligence,” but in “architecture,” and this architecture is simple enough to be written on a card.

Harness (operational framework) is the product itself.

On March 31, 2026, Anthropic unexpectedly released the full source code of Claude Code on npm—totaling 512k lines. I read through it. It confirmed what I’ve been saying at YC: the real secret isn’t in the model, but in “the layer wrapping the model.”

Real-time code repository context, prompt caching, tools designed for specific tasks, compressing redundant context as much as possible, structured session memory, parallel sub-agents—these don’t make the model smarter. But they provide the model with the “right context at the right time” and prevent irrelevant information from overwhelming it.

This “wrapping layer” is called the harness (operational framework). The key question all AI builders should ask is: what should go into the harness, and what should stay outside?

The answer is quite specific—I call it: thin harness, fat skills.

Five Definitions

Bottlenecks are never in the model’s intelligence. Models already know how to reason, synthesize information, and write code.

They fail because they don’t understand your data—your schema, your conventions, what the problem specifically looks like. The following five definitions are precisely designed to solve this problem.

Skill file

A skill file is a reusable markdown document that teaches the model “how to do a task.” Note, it doesn’t tell the model “what to do”—that’s provided by the user. The skill file provides the process.

The key point most overlook is: a skill file is like a method call. It can accept parameters. You can invoke it with different parameters. The same process, with different inputs, can demonstrate vastly different capabilities.

For example, there’s a skill called /investigate. It includes seven steps: define data scope, build timeline, diarize each document, synthesize and summarize, argue from both sides, cite sources. It takes three parameters: TARGET, QUESTION, and DATASET.

If you point it at a security scientist and 2.1 million forensic emails, it becomes a medical research analyst, judging whether a whistleblower was suppressed.

If you point it at a shell company and filings from the U.S. Federal Election Commission (FEC), it becomes a legal investigation agent, tracking coordinated political donations.

It’s the same skill. Same seven steps. Same markdown file. The skill describes a judgment process, but what makes it real in the world are the parameters passed during invocation.

This isn’t prompt engineering; it’s software design—using markdown as a programming language, human judgment as the runtime environment. In fact, markdown may be better suited than rigid source code for encapsulating capabilities because it describes processes, judgments, and context—precisely what models “understand” best.

Harness (operational framework)

Harness is the layer of code that drives the LLM. It does four things: run the model in a loop, read/write your files, manage context, and enforce safety constraints.

That’s it. That’s “thin.”

The opposite pattern is: fat harness, thin skills.

You’ve probably seen this: over 40 tool definitions, just the explanations take up half the context window; a single all-in-one God-tool that takes 2 to 5 seconds per MCP round; or wrapping each REST API endpoint as a separate tool. The result: token usage triples, latency triples, failure rate triples.

The ideal approach is to use purpose-built, fast, narrow-function tools.

For example, a Playwright CLI where each browser operation takes only 100 milliseconds; versus a Chrome MCP that takes 15 seconds for a screenshot, find, click, wait, read. The former is 75 times faster.

Modern software no longer needs to be “overly refined and bulky.” Your goal should be: build only what you truly need, and nothing more.

Resolver

A resolver is essentially a context routing table. When task type X appears, it prioritizes loading document Y. Skills tell the model “how to do”; resolvers tell the model “when to load what.”

For example, a developer modifies a prompt. Without a resolver, they might just deploy it after editing. With a resolver, the model first reads docs/EVALS.md, which states: run evaluation suite, compare scores before and after; if accuracy drops more than 2%, rollback and troubleshoot. The developer might not even know the evaluation suite exists. It’s the resolver that loads the correct context at the right moment.

Claude Code has a built-in resolver. Each skill has a description field, and the model automatically matches user intent with the skill’s description. You don’t need to remember if /ship exists—the description itself is the resolver.

Honestly, my previous CLAUDE.md was 20k lines long, packed with quirks, patterns, lessons learned. It was absurd. The model’s attention quality declined noticeably. Claude Code even made me cut it down.

The final fix is about 200 lines—just a few document pointers. The resolver loads the needed document at critical moments, keeping the knowledge base accessible without polluting the context window.

Latent and deterministic

In your system, each step is either in the latent space or the deterministic space. Confusing the two is the most common mistake in agent design.

·Latent space is where intelligence resides. The model reads, understands, judges, and decides here. It handles reasoning, synthesis, pattern recognition.

·Deterministic is where reliability resides. Same input always yields the same output. SQL queries, compiled code, arithmetic operations all belong here.

An LLM can help you seat 8 people considering personalities and social ties. But if asked to seat 800, it might produce a “plausible but completely wrong” seating chart. That’s because it’s no longer a latent space problem but a deterministic combinatorial optimization problem.

The worst systems blur the boundary, misplacing work on both sides. The best systems clearly delineate the boundary.

Diarization (document organization / topic profiling)

Diarization is the key step that truly makes AI valuable for real-world knowledge work.

It means: the model reads all materials related to a topic, then writes a structured profile. Summarize judgments from dozens or hundreds of documents onto one page.

This isn’t something SQL queries can produce. Nor can RAG pipelines. The model must actually read, hold conflicting information in mind, notice what changes and when, then synthesize this into a structured intelligence.

It’s the difference between database queries and analyst reports.

This architecture

These five concepts can be combined into a very simple three-layer architecture:

·Top layer: Fat skills—markdown-written processes carrying judgment, methodology, domain knowledge. 90% of value is here.
·Middle layer: Thin CLI harness—about 200 lines of code, input JSON, output text, read-only by default.
·Bottom layer: Your application infrastructure—QueryDB, ReadDoc, Search, Timeline—deterministic foundational tools.

The core principle is directional: push “intelligence” upward into skills; push “execution” downward into deterministic tools; keep the harness lightweight.

The result: whenever model capabilities improve, all skills automatically strengthen; the deterministic infrastructure remains stable and reliable.

Learning systems

Now I’ll demonstrate how these five definitions work together with a real system we’re building at YC.

July 2026, Chase Center. Startup School has 6,000 founders. Each has structured application materials, questionnaire responses, transcripts of 1:1 conversations with mentors, and public signals: posts on X, GitHub commits, Claude Code usage logs (showing development speed).

Traditional approach: 15-person project teams read applications one by one, rely on intuition, then update a spreadsheet.

This works at 200 people but fails at 6,000. No human can hold that many profiles in their mind and realize: the top three candidate infrastructures are the founders of developer tools in Lagos, compliance entrepreneurs in Singapore, and CLI tool developers in Brooklyn—yet they describe the same pain points in completely different ways across different 1:1s.

The model can do this. The method:

Enrichment

A skill called /enrich-founder pulls all data sources, performs information enhancement, diarization, and highlights differences between “what the founder says” and “what they actually do.”

The deterministic system handles: SQL queries, GitHub data, browser tests of demo URLs, social signals, CrustData queries, etc. It runs once daily. The profiles of 6,000 founders stay current.

Diarization outputs capture insights that keyword searches can’t find:

Differences like “statement vs. actual behavior” require reading GitHub history, application materials, and conversations simultaneously, then integrating in mind. No embedding similarity search or keyword filtering can do this. The model must read fully and judge. (This is exactly what belongs in latent space!)

Matching

This is where “skills = method calls” shine.

The same matching skill, invoked three times, can produce different strategies:

/match-breakout: cluster 1,200 people by domain, 30 per group (embedding + deterministic assignment)
/match-lunch: handle 600 people, cross-domain “accidental matches,” 8 per table, no repeats—generate topics first, then seat accordingly
/match-live: real-time matching of participants on-site, within 200ms, 1-to-1, excluding those already met

The model can also make judgments beyond traditional clustering:

“Santos and Oram are both in AI infrastructure but not competitors—Santos does cost attribution, Oram handles orchestration. They should be grouped.”
“Kim applied as a developer tool, but 1:1 chats show he’s working on SOC2 compliance automation. Reclassify to FinTech / RegTech.”

Such reclassification can’t be captured by embeddings alone. The model must read the full profile.

Learning cycle

After activities, an /improve skill reads NPS feedback, diarizes “okay” responses—not bad reviews, but “almost there”—and extracts patterns.

It then updates rules and rewrites the matching skill:

If a participant mentions “AI infrastructure” but 80% of their code is billing modules: → classify as FinTech, not AI Infra
If two people in the same group already know each other: → lower matching weight, prioritize new relationships

These rules are written back into the skill files, taking effect next run. Skills “self-rewrite.” In July, “okay” ratings were 12%; in the next event, down to 4%.

The skill files learn what “okay” means, and the system improves without manual code rewriting.

This pattern can be applied to any domain:

Retrieve → Read → Diarize → Count → Synthesize
Then: Research → Investigate → Diarize → Rewrite skill

If you ask what the most valuable cycle in 2026 is, it’s this one. It can be applied to nearly all knowledge work scenarios.

Skills are permanently upgradeable

Recently, I posted an instruction to OpenClaw on X, which received more than a thousand likes and over two thousand saves:

Many thought it was prompt engineering.

But it’s not. It’s the architecture described above. Every skill you write is a permanent upgrade to the system. It doesn’t degrade or forget. It runs automatically at 3 a.m. And when the next-generation models are released, all skills instantly become stronger—the judgment in latent space improves, while the deterministic parts remain stable and reliable.

That’s the source of Yegge’s 100x efficiency.

Not smarter models, but: fat skills, thin harness, and the discipline to solidify everything into capabilities.

The system compounds growth. Build once, run long-term.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.