Local Models Within Reach: Everything That Changed in Eight Months

Apr 5, 2026 · 2294 words · 11 minute read

Eight months ago I published Building Agents for Small Language Models, a set of hard-won notes from shipping agents on 270M–32B parameter models. At the time, running useful local models meant embracing constraints: small context windows, CPU-only fallbacks, broken UTF-8 streams, and reasoning that fell apart past two steps.

I stand by that post. But the ground has shifted fast. What was a set of careful workarounds in August 2025 is starting to look like the default architecture for a large class of workloads. Local models are no longer the constrained sibling of cloud APIs — for many agent use cases, they are the better answer. Here is what has changed.

The models got dramatically better 🔗

The open-weights frontier has caught up with where proprietary labs were roughly a year ago, and the models are now shaped for the hardware most people already own.

Gemma 4. Google rolled out its latest open-weights family, spanning 2B to 31B parameters. The 26B variant in particular is being widely described as a “perfect local model” — small enough to fit in a high-end laptop’s unified memory, large enough to handle the reasoning and tool calling I was still fighting to get reliably out of a 7B last summer. That I can write that sentence without heavy qualification is itself a milestone.

Qwen3.5-35B-A3B. Qwen has been leaning hard into Mixture-of-Experts, and this release is the clearest signal yet that MoE is the right shape for local deployment: 35B parameters on disk, only 3B active per token. You get the quality ceiling of a much larger model with the memory bandwidth and throughput profile of a tiny one. Below the frontier, the economics of local inference are now MoE-shaped.

The cohort of models you can actually run on hardware under your desk has stopped being a curiosity and started being the serious option.

The memory market came back to earth 🔗

“Just buy more RAM” wasn’t a viable answer through 2025 because you simply couldn’t: prices were elevated, lead times were long, and the AI capex cycle had drained the pipeline. That cycle is visibly unwinding. OpenAI has called off its big RAM orders, DRAM prices are normalizing, and memory stocks are reflecting the shift.

Every dollar off the cost of a 64GB or 128GB machine is a dollar closer to a world where running a 26B dense or 35B MoE model locally is a default developer-workstation build, not a specialty one. The economics of “rent tokens from a hyperscaler” get weaker every quarter the hardware to run them yourself gets cheaper.

Performance engineering is cool again 🔗

For most of 2024 and 2025, the answer to any inference problem was: throw more compute at it. That era is ending. I have been writing about this undercurrent for a while — from squeezing AlphaFold’s triangle multiplicative update down to 4ms on an H100, to reaching for Gluon when Triton isn’t low-level enough, to the AMD side of Gluon and multi-GPU work on AMD via Iris, to the bit-exact CUDA FFT port to Mojo and the floating-point nondeterminism that haunts all of it. The same thread runs through all of them: the big wins are no longer hiding in larger models, they are hiding in the details of how we execute the ones we already have.

TurboQuant just landed, extending a broader pattern of aggressive quantization work that keeps eating into the accuracy gap between full-precision and heavily compressed models. Every generation of quantization research effectively doubles the quality-per-gigabyte of whatever checkpoint you already have.

Apple’s MLX has become a first-class runtime for local inference on Apple Silicon. The unified memory architecture that looked like a curiosity two years ago is now the right shape for LLM inference: large pools of fast memory shared between CPU and GPU, with none of the PCIe round-trips that bottleneck traditional setups. MLX is a reminder that the answer to inference is not always “more NVIDIA.”

Unsloth has made RL fine-tuning approachable enough that small teams — and individuals — can run post-training loops on modest hardware. Customizing a local model for a specific task has collapsed from “needs a research team” to “needs a weekend.”

RISC-V is showing up in consumer silicon. Samsung’s BM9K1 — a PCIe 5.0 QLC SSD hitting 11.4GB/s, shipping in laptops in 2027 — uses RISC-V cores in its controller instead of the usual ARM Cortex-R, the first time Samsung has put RISC-V in a commercial consumer product. On its own it is just a storage controller, but the signal matters. Between MLX on Apple Silicon, AMD closing the GPU gap, and RISC-V creeping up from the controller layer, the “NVIDIA + x86” default that defined the last decade of AI infrastructure is no longer the only path.

Put these together and the story is simple: people have stopped pretending that wasting resources is a strategy. Performance engineering — the boring discipline of making things smaller, faster, and cheaper — is where the most interesting progress is happening again.

Security is the unlock nobody is pricing in 🔗

There is one more reason local models matter, and it is the one I keep repeating: security will be the biggest pushback on agentic AI in the enterprise, and most of the agents we see today are not cut for that environment.

The industry has spent two years shipping agents that assume a flat trust model — one model, one context, one credential pool, broad tool access, everything in the same process talking to the same API. Fine for a demo. Terrible for a Fortune 500 that has spent twenty years building segmentation, data classification, least-privilege access, and audit trails specifically to contain the blast radius of any compromised component. The moment you drop a broadly-scoped agent with a cloud-hosted model into that environment, you have punched a hole through every layer the security team built.

Local models are the natural fix because they let you put the model itself inside the segmentation boundary. A customer-support agent can run on a model that physically cannot reach the finance VLAN. A code-assist agent can run on a model that has never seen credentials outside its sandbox. Different business units can run different models under different governance regimes, fine-tunes, and retention policies — the same way they already run different databases, identity providers, and key-management systems. Local inference is what makes that segmentation real rather than aspirational.

Every CISO I talk to is wrestling with the same question: how do you let an autonomous agent act on behalf of a user without violating least privilege, without creating an unauditable side-channel to every system it touches, and without trusting a vendor’s weights and infrastructure with data legal spent a decade walling off? The honest answer today is that you mostly can’t, which is why serious enterprise rollouts are moving slower than the demos suggest.

Worth saying out loud: “local” is mostly a rebrand of what we used to call on-premise. The industry dropped that term because it carried twenty years of baggage — clunky deployments, long procurement cycles — but the underlying idea is the one enterprise security has been defending for a generation: keep the compute, the data, and the trust boundary somewhere you actually control. Local models drag that principle back into the AI conversation, where it quietly went missing the moment everything became an API call.

This is also where I spend most of my time at OnDB. Safe data access is the hard problem underneath agentic AI — not “can the model answer,” but “should this model, acting on behalf of this user, with these credentials, in this context, be allowed to see this row, this document, this secret at all?” You cannot answer that honestly if the model runs in someone else’s data center with an opaque retention policy. You can answer it if the model sits inside your segmentation boundary and the access layer enforces policy before anything reaches its context window. Local inference is one of the parameters that makes safe data access tractable, alongside identity, policy, and auditability.

Local models do not solve this alone — you still need segmentation, identity, audit, sandboxing, and a data plane that understands what least privilege means for a non-human actor. But they make the posture achievable at all. For a lot of enterprises, “cloud or local” will collapse into “can our security team actually sign off on this?” — and the answer is going to be local far more often than the current narrative suggests.

Public data is a commodity; private data is the moat 🔗

There is a related shift on the data side, and it points the same direction.

For two years, “give the agent a web search API” has been treated as the universal answer to context. It is not. Public web data is a commodity — every agent hits the same endpoints, scrapes the same pages, retrieves the same Wikipedia paragraphs, and gets back the same flattened, low-signal context. If your agent’s differentiator is that it can Google things, you do not have a differentiator.

What will actually differentiate enterprise agents is the quality and provenance of the private data they can reach: internal documents, historical transactions, customer records, telemetry, domain-specific corpora that a company has spent decades building and no public crawler will ever see. Proprietary data providers — vetted, licensed, structured, unscrapable — are going to matter more than any public search API. Context enrichment from trusted sources is a bigger lever on agent quality than another 10B parameters on the model.

This is the other half of what I am building at OnDB. We programmatically generate skill manifests (skills.md) for trusted data providers, so an agent can discover, reason about, and safely query private sources the same way it discovers a tool. The skills.md layer turns “I have a database” and “I have an agent” into a composable contract: the agent knows what a provider exposes, under what policy, with what schema, and the provider knows who is asking and why. Across a directory of trusted providers, an agent’s effective intelligence stops being a function of the model and starts being a function of the data it is allowed to stand on.

Local inference and trusted-data skills reinforce each other. Local gives you the security posture that makes private providers willing to be reached at all. Private data gives the local model something to say that no cloud-hosted generalist can match. The combination is where I think enterprise agentic AI is actually going.

What this means for agent builders 🔗

A lot of the defensive advice from the August post still applies. You still want multi-layer safety, structured I/O, and logic externalized from prompts into code. Those patterns were never about small models being bad — they were about building robust software, which is good advice regardless of model size.

But some of the sharper constraints have softened:

Context windows are no longer the 4K–8K straitjacket they were. You can run an agent with a sensible conversation history without fighting for every token.
Reasoning on a 26B Gemma 4 or Qwen3.5-35B-A3B is qualitatively different from a 7B dense model last summer. Chain-of-Thought is no longer an automatic failure mode at this size class.
Tool calling is reliable. Structured outputs, which I argued for as a robustness hack, are now a first-class feature of the training itself.
Fine-tuning is not a last resort. With Unsloth-class tooling, it is a normal step in the build loop.

The defensive patterns are still useful, but they are no longer the entire game. You can now build local agents that look like the cloud agents you were shipping a year ago, with the privacy, latency, and cost profile only local deployment gives you.

The durable direction 🔗

Five things have become clear to me over the last eight months, and they do not look like a fad:

The default size class for a useful local model is creeping upward, fast. A year ago it was 7B. Today it is 26B dense or 35B MoE. The hardware is racing up to meet it.
MoE is the right shape for local inference. Active parameters, not total parameters, are the resource that matters. Every serious open-weights lab has figured this out.
Hardware and software are finally being co-designed for inference. Apple with MLX, the broader llama.cpp ecosystem, the quantization researchers, the RL tooling teams — they are all pulling in the same direction, and the compounding effect is significant.
Security is the forcing function the industry has not priced in yet. Enterprise agentic AI will be gated by what CISOs can sign off on, not by what the frontier labs can demo. That path runs through segmentation, and segmentation runs through local.
Public data is a commodity; private data is the moat. Web search APIs will not differentiate agents. Trusted, licensed, proprietary data providers — exposed through programmatically generated skill manifests — will.

In August, the honest framing was: here is how to build useful things under difficult constraints. Today the constraints are still there, but receding, and the question is no longer “can I do this locally?” but “is there any reason not to?”

For a large and growing slice of workloads — privacy-sensitive, latency-sensitive, cost-sensitive, anything where you want the model inside your product rather than a vendor dependency — the answer is increasingly that there isn’t. Local is finally within reach.

Let’s Connect: I am building OnDB around the belief that safe proprietary data access is what will make agents smarter — not bigger models, not bigger context windows, but trusted, governed, private data the agent is actually allowed to stand on. If you are building in this space, I would love to compare notes. Reach out via email or on X (@msuiche).