Technical on Matt Suiche

AMD GPU Support in Triton Gluon Framework

Wed, 15 Oct 2025 02:00:00 +0000

Introduction 🔗

This document analyzes AMD GPU support implementation in Triton’s Gluon framework, examining architecture-specific optimizations, performance characteristics, and implementation details relative to NVIDIA GPU support.

For background on Gluon and its motivation as a lower-level alternative to Triton, see my previous post: “Gluon: When Triton Isn’t Low-Level Enough”.

Background: GPU Programming Architecture Landscape 🔗

The GPU programming ecosystem has evolved with distinct architectural approaches between NVIDIA and AMD, creating implementation challenges for cross-platform frameworks.

RustBPE: High-Performance BPE Tokenizer Training in Rust

Wed, 15 Oct 2025 00:00:00 +0000

Introduction 🔗

Byte Pair Encoding (BPE) tokenization is used in modern language models, but efficient training implementations are limited. OpenAI’s tiktoken handles inference well, while HuggingFace’s tokenizers supports training but has complexity and overhead. RustBPE is a Rust implementation that provides training capabilities with better performance.

RustBPE was developed by Andrej Karpathy as part of the nanochat project. This analysis covers the RustBPE implementation, including its architecture, performance characteristics, and Python integration.

For those interested in understanding BPE implementation from first principles, Sebastian Raschka provides an excellent deep-dive into implementing BPE from scratch in his blogpost, and this is also covered in his book “Build a Large Language Model (From Scratch)”. His work offers invaluable insights into the algorithmic foundations that underpin implementations like RustBPE.

Optimizing AlphaFold's Triangle Multiplicative Update: A First Look at GPU Performance Engineering

Tue, 30 Sep 2025 00:00:00 +0000

Background 🔗

I recently encountered the GPU MODE TriMul challenge while exploring GPU optimization. Coming from a systems engineering background without prior PyTorch or Triton experience, this challenge provided an opportunity to learn GPU performance engineering through a practical problem.

The Triangle Multiplicative Update (TriMul) is a core operation in AlphaFold2 and AlphaFold3—the protein structure prediction systems that earned the 2024 Nobel Prize in Chemistry. The operation’s O(n³) complexity creates severe performance bottlenecks in production, forcing AlphaFold3 to use batch size 1 during training despite having under 1B parameters. This makes the optimization problem both practically relevant and technically challenging.

Gluon: When Triton Isn't Low-Level Enough

Tue, 23 Sep 2025 00:00:00 +0000

My Journey from PyTorch to Gluon 🔗

After spending the last month diving into PyTorch, learning Triton, understanding CUDA, and even peeking at PTX/SASS assembly, I’ve come to a surprising realization: I’ve yet to meet anyone who’s actually writing raw CUDA code in production anymore. Everyone I’ve talked to – from ML engineers at startups to researchers at big tech companies – seems to have converged on Triton as their go-to solution for custom GPU kernels. And honestly? The fused kernels performance they’re getting is impressive enough that I understand why.

The Hidden Math Bug That Makes AI Unpredictable

Sun, 14 Sep 2025 00:00:00 +0200

This tweet from Awni Hannun demonstrates in one line of MLX code the nondeterminism phenomenon detailed in Thinking Machines’ research. We will explore the PyTorch equivalent that reveals a fundamental issue in AI systems, because I’ve found that tweet extremely helpful to understand what the original blogpost was about.

Here's a one-line code summary in MLX of the @thinkymachines blog post on non-determinism in LLM inference.

I'd guess the difference is larger the lower the precision, as you get larger affects from non-associativity of FP math.

Interestingly, that implies that training at low… pic.twitter.com/jYcDK9GiLn