Hello! My name is Matt Suiche. I am the founder of OnDB Inc., a data infrastructure startup for the agentic economy. I recently discussed cyberwar in the age of AI, Iran’s cyber capabilities, and how AI is reshaping hacking on Bloomberg’s Odd Lots and the National Security Lab podcast.
Previously, I co-founded CloudVolumes (acquired by VMware in 2014) and Comae Technologies (acquired by Magnet Forensics in 2022), where I later served as Head of Detection Engineering. I also founded the cybersecurity community project OPCDE.
My path into technology started in reverse engineering as a teenager, and has since spanned memory forensics, operating systems, virtualization, blockchain, and now AI infrastructure.
Latest
The internet was built with a missing piece. In 1994, when the HTTP specification reserved status code 402 for “Payment Required,” the architects knew money would eventually flow as freely as data. Three decades later, that vision is finally materializing—not because humans demanded it, but because AI agents need it.
The 402 Awakening 🔗HTTP 402 sat dormant for years, a placeholder for a future nobody could quite figure out. Credit cards required human intervention. PayPal needed accounts. Stripe demanded integration. None of these worked for a world where software talks to software at millisecond intervals.
Porting a CUDA Fast Fourier Transform (FFT) implementation to Mojo for the LeetGPU Fast Fourier Transform challenge presented an unexpected challenge: achieving bit-exact precision matching between CUDA’s sinf()/cosf() functions and their Mojo equivalents. This required PTX assembly analysis, cross-platform testing, and ultimately upgrading to Float64 precision for deterministic results.
Challenge Constraints 🔗 N range: $1 \leq N \leq 262,144$ (power-of-2 FFT sizes) Data type: All values are 32-bit floating point numbers Accuracy requirements: Absolute error $\leq 10^{-3}$, Relative error $\leq 10^{-3}$ Array format: Input and output arrays have length $2N$ (interleaved real/imaginary) Initial Problem: Accuracy Mismatch 🔗The initial Mojo FFT implementation failed correctness tests with a maximum absolute difference of 0.023 compared to the reference CUDA implementation. For a coding challenge requiring exact equality, this was unacceptable.
Introduction 🔗This document analyzes AMD GPU support implementation in Triton’s Gluon framework, examining architecture-specific optimizations, performance characteristics, and implementation details relative to NVIDIA GPU support.
For background on Gluon and its motivation as a lower-level alternative to Triton, see my previous post: “Gluon: When Triton Isn’t Low-Level Enough”.
Background: GPU Programming Architecture Landscape 🔗The GPU programming ecosystem has evolved with distinct architectural approaches between NVIDIA and AMD, creating implementation challenges for cross-platform frameworks.