<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Performance on Matt Suiche</title><link>https://www.msuiche.com/categories/performance/</link><description>Recent content in Performance on Matt Suiche</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Wed, 15 Oct 2025 00:00:00 +0000</lastBuildDate><atom:link href="https://www.msuiche.com/categories/performance/index.xml" rel="self" type="application/rss+xml"/><item><title>RustBPE: High-Performance BPE Tokenizer Training in Rust</title><link>https://www.msuiche.com/posts/rustbpe-high-performance-bpe-tokenizer-training-in-rust/</link><pubDate>Wed, 15 Oct 2025 00:00:00 +0000</pubDate><guid>https://www.msuiche.com/posts/rustbpe-high-performance-bpe-tokenizer-training-in-rust/</guid><description>&lt;h2 id="introduction"&gt;Introduction &lt;a href="#introduction" class="anchor"&gt;🔗&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Byte Pair Encoding (BPE) tokenization is used in modern language models, but efficient training implementations are limited. OpenAI&amp;rsquo;s &lt;code&gt;tiktoken&lt;/code&gt; handles inference well, while HuggingFace&amp;rsquo;s &lt;code&gt;tokenizers&lt;/code&gt; supports training but has complexity and overhead. &lt;strong&gt;RustBPE&lt;/strong&gt; is a Rust implementation that provides training capabilities with better performance.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;RustBPE was developed by Andrej Karpathy&lt;/strong&gt; as part of the &lt;a href="https://github.com/karpathy/nanochat/tree/master/rustbpe" target="_blank" rel="noopener"&gt;nanochat project&lt;/a&gt;. This analysis covers the RustBPE implementation, including its architecture, performance characteristics, and Python integration.&lt;/p&gt;
&lt;p&gt;For those interested in understanding BPE implementation from first principles, &lt;a href="https://sebastianraschka.com/blog/2025/bpe-from-scratch.html" target="_blank" rel="noopener"&gt;Sebastian Raschka provides an excellent deep-dive into implementing BPE from scratch&lt;/a&gt; in his blogpost, and this is also covered in his book &amp;ldquo;Build a Large Language Model (From Scratch)&amp;rdquo;. His work offers invaluable insights into the algorithmic foundations that underpin implementations like RustBPE.&lt;/p&gt;</description></item><item><title>Optimizing AlphaFold's Triangle Multiplicative Update: A First Look at GPU Performance Engineering</title><link>https://www.msuiche.com/posts/optimizing-alphafolds-triangle-multiplicative-update-a-first-look-at-gpu-performance-engineering/</link><pubDate>Tue, 30 Sep 2025 00:00:00 +0000</pubDate><guid>https://www.msuiche.com/posts/optimizing-alphafolds-triangle-multiplicative-update-a-first-look-at-gpu-performance-engineering/</guid><description>&lt;h2 id="background"&gt;Background &lt;a href="#background" class="anchor"&gt;🔗&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I recently encountered the &lt;a href="https://www.gpumode.com/v2/leaderboard/496?tab=submission" target="_blank" rel="noopener"&gt;GPU MODE TriMul challenge&lt;/a&gt; while exploring GPU optimization. Coming from a systems engineering background without prior PyTorch or Triton experience, this challenge provided an opportunity to learn GPU performance engineering through a practical problem.&lt;/p&gt;
&lt;p&gt;The Triangle Multiplicative Update (TriMul) is a core operation in AlphaFold2 and AlphaFold3—the protein structure prediction systems that earned the 2024 Nobel Prize in Chemistry. The operation&amp;rsquo;s O(n³) complexity creates severe performance bottlenecks in production, forcing AlphaFold3 to use batch size 1 during training despite having under 1B parameters. This makes the optimization problem both practically relevant and technically challenging.&lt;/p&gt;</description></item></channel></rss>