GPU Programming on Matt Suiche

Porting CUDA FFT to Mojo: Achieving Bit-Exact Precision

Fri, 17 Oct 2025 00:00:00 +0000

Porting a CUDA Fast Fourier Transform (FFT) implementation to Mojo for the LeetGPU Fast Fourier Transform challenge presented an unexpected challenge: achieving bit-exact precision matching between CUDA’s sinf()/cosf() functions and their Mojo equivalents. This required PTX assembly analysis, cross-platform testing, and ultimately upgrading to Float64 precision for deterministic results.

Challenge Constraints 🔗

N range: $1 \leq N \leq 262,144$ (power-of-2 FFT sizes)
Data type: All values are 32-bit floating point numbers
Accuracy requirements: Absolute error $\leq 10^{-3}$, Relative error $\leq 10^{-3}$
Array format: Input and output arrays have length $2N$ (interleaved real/imaginary)

Initial Problem: Accuracy Mismatch 🔗

The initial Mojo FFT implementation failed correctness tests with a maximum absolute difference of 0.023 compared to the reference CUDA implementation. For a coding challenge requiring exact equality, this was unacceptable.

AMD GPU Support in Triton Gluon Framework

Wed, 15 Oct 2025 02:00:00 +0000

Introduction 🔗

This document analyzes AMD GPU support implementation in Triton’s Gluon framework, examining architecture-specific optimizations, performance characteristics, and implementation details relative to NVIDIA GPU support.

For background on Gluon and its motivation as a lower-level alternative to Triton, see my previous post: “Gluon: When Triton Isn’t Low-Level Enough”.

Background: GPU Programming Architecture Landscape 🔗

The GPU programming ecosystem has evolved with distinct architectural approaches between NVIDIA and AMD, creating implementation challenges for cross-platform frameworks.

Gluon: When Triton Isn't Low-Level Enough

Tue, 23 Sep 2025 00:00:00 +0000

My Journey from PyTorch to Gluon 🔗

After spending the last month diving into PyTorch, learning Triton, understanding CUDA, and even peeking at PTX/SASS assembly, I’ve come to a surprising realization: I’ve yet to meet anyone who’s actually writing raw CUDA code in production anymore. Everyone I’ve talked to – from ML engineers at startups to researchers at big tech companies – seems to have converged on Triton as their go-to solution for custom GPU kernels. And honestly? The fused kernels performance they’re getting is impressive enough that I understand why.