Machine Learning on Matt Suiche

Local Models Within Reach: Everything That Changed in Eight Months

Sun, 05 Apr 2026 00:00:00 +0000

Eight months ago I published Building Agents for Small Language Models, a set of hard-won notes from shipping agents on 270M–32B parameter models. At the time, running useful local models meant embracing constraints: small context windows, CPU-only fallbacks, broken UTF-8 streams, and reasoning that fell apart past two steps.

I stand by that post. But the ground has shifted fast. What was a set of careful workarounds in August 2025 is starting to look like the default architecture for a large class of workloads. Local models are no longer the constrained sibling of cloud APIs — for many agent use cases, they are the better answer. Here is what has changed.

RustBPE: High-Performance BPE Tokenizer Training in Rust

Wed, 15 Oct 2025 00:00:00 +0000

Introduction 🔗

Byte Pair Encoding (BPE) tokenization is used in modern language models, but efficient training implementations are limited. OpenAI’s tiktoken handles inference well, while HuggingFace’s tokenizers supports training but has complexity and overhead. RustBPE is a Rust implementation that provides training capabilities with better performance.

RustBPE was developed by Andrej Karpathy as part of the nanochat project. This analysis covers the RustBPE implementation, including its architecture, performance characteristics, and Python integration.

For those interested in understanding BPE implementation from first principles, Sebastian Raschka provides an excellent deep-dive into implementing BPE from scratch in his blogpost, and this is also covered in his book “Build a Large Language Model (From Scratch)”. His work offers invaluable insights into the algorithmic foundations that underpin implementations like RustBPE.

Optimizing AlphaFold's Triangle Multiplicative Update: A First Look at GPU Performance Engineering

Tue, 30 Sep 2025 00:00:00 +0000

Background 🔗

I recently encountered the GPU MODE TriMul challenge while exploring GPU optimization. Coming from a systems engineering background without prior PyTorch or Triton experience, this challenge provided an opportunity to learn GPU performance engineering through a practical problem.

The Triangle Multiplicative Update (TriMul) is a core operation in AlphaFold2 and AlphaFold3—the protein structure prediction systems that earned the 2024 Nobel Prize in Chemistry. The operation’s O(n³) complexity creates severe performance bottlenecks in production, forcing AlphaFold3 to use batch size 1 during training despite having under 1B parameters. This makes the optimization problem both practically relevant and technically challenging.

The Hidden Math Bug That Makes AI Unpredictable

Sun, 14 Sep 2025 00:00:00 +0200

This tweet from Awni Hannun demonstrates in one line of MLX code the nondeterminism phenomenon detailed in Thinking Machines’ research. We will explore the PyTorch equivalent that reveals a fundamental issue in AI systems, because I’ve found that tweet extremely helpful to understand what the original blogpost was about.

Here's a one-line code summary in MLX of the @thinkymachines blog post on non-determinism in LLM inference.

I'd guess the difference is larger the lower the precision, as you get larger affects from non-associativity of FP math.

Interestingly, that implies that training at low… pic.twitter.com/jYcDK9GiLn

Building Agents for Small Language Models: A Deep Dive into Lightweight AI

Wed, 27 Aug 2025 00:00:00 +0000

The landscape of AI agents has been dominated by large language models (LLMs) like GPT-4 and Claude, but a new frontier is opening up: lightweight, open-source, locally-deployable agents that can run on consumer hardware. This post shares internal notes and discoveries from my journey building agents for small language models (SLMs) – models ranging from 270M to 32B parameters that run efficiently on CPUs or modest GPUs. These are lessons learned from hands-on experimentation, debugging, and optimizing inference pipelines.