Hello! My name is Matt Suiche. I am an independent researcher, advisor, and investor. I previously served as the Head of Detection Engineering at Magnet Forensics. Our organization was passionately dedicated to justice and protecting the innocent, a mission we embarked on more intensely after the 2022 acquisition of my cybersecurity start-up, Comae Technologies.
My life-long fascination with learning and understanding complex systems first led me to cybersecurity. My teenage years were spent immersed in reverse engineering, which ignited a profound curiosity about technology that continues to this day. I’ve since explored various fields including operating systems architecture, programming languages, virtualization, modern web application development, and generative art. Furthermore, I’ve delved into numerous domains such as privacy, surveillance, forensics, blockchain, and community development among others.
Background đź”—I recently encountered the GPU MODE TriMul challenge while exploring GPU optimization. Coming from a systems engineering background without prior PyTorch or Triton experience, this challenge provided an opportunity to learn GPU performance engineering through a practical problem.
The Triangle Multiplicative Update (TriMul) is a core operation in AlphaFold2 and AlphaFold3—the protein structure prediction systems that earned the 2024 Nobel Prize in Chemistry. The operation’s O(n³) complexity creates severe performance bottlenecks in production, forcing AlphaFold3 to use batch size 1 during training despite having under 1B parameters.
GPU production constraints are creating infrastructure bottlenecks. Multi-GPU programming, particularly vendor-agnostic implementations, has become essential. In their GPU Mode presentation, AMD Research engineers Muhammad Awad, Muhammad Osama, and Brandon Potter introduced Iris—a Python library that enables fine-grained multi-GPU programming in Triton. Similarly to my previous Gluon blogpost, this post captures my understanding and interpretation of their work, serving as both technical documentation and personal reference for this emerging multi-GPU programming paradigm.
My Journey from PyTorch to Gluon 🔗After spending the last month diving into PyTorch, learning Triton, understanding CUDA, and even peeking at PTX/SASS assembly, I’ve come to a surprising realization: I’ve yet to meet anyone who’s actually writing raw CUDA code in production anymore. Everyone I’ve talked to – from ML engineers at startups to researchers at big tech companies – seems to have converged on Triton as their go-to solution for custom GPU kernels.