Gluon: When Triton Isn't Low-Level Enough

Sep 23, 2025 · 2026 words · 10 minute read

My Journey from PyTorch to Gluon 🔗

After spending the last month diving into PyTorch, learning Triton, understanding CUDA, and even peeking at PTX/SASS assembly, I’ve come to a surprising realization: I’ve yet to meet anyone who’s actually writing raw CUDA code in production anymore. Everyone I’ve talked to – from ML engineers at startups to researchers at big tech companies – seems to have converged on Triton as their go-to solution for custom GPU kernels. And honestly? The fused kernels performance they’re getting is impressive enough that I understand why.

But just when I thought I had the GPU programming landscape figured out, I stumbled upon something the Triton team has been quietly pushing: Gluon. I’ll try to explain why this really interesting.

The CUDA Paradox 🔗

Here’s what puzzled me: CUDA has been around since 2007. It’s mature, well-documented, and theoretically gives you complete control over the GPU. Yet in my journey through various ML communities, Discord servers, and conference talks, I noticed something odd – almost nobody is writing CUDA kernels anymore.

The few who claimed to write “CUDA” were actually using libraries like cuBLAS, cuDNN, or Thrust. The actual kernel writers? They’d all migrated to Triton. When I asked why, the answer was always the same: “Triton gets me 80-90% of peak performance with 10% of the effort.”

Enter Triton: The Sweet Spot 🔗

My own experience with Triton confirmed this. After reading books like Programming Massively Parallel Processor & CUDA tutorials and wasting time in warp synchronization primitives, Triton felt like a breath of fresh air:

@triton.jit
def simple_addition_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    tl.store(output_ptr + offsets, x + y, mask=mask)

Straight forward, and it just works. The compiler handles all the complexity I was drowning in with CUDA – mallocs, memory coalescing, shared memory banking, instruction scheduling. Beautiful.

The Unsloth team has their triton kernels open-sourced, which is pretty good to use as a reference to learn how to have complex implementations.

The Performance Ceiling 🔗

But here’s where my story takes a turn. While benchmarking various Triton kernels, I kept hitting walls. Then after reading this blogpost on Flash Attention, I discovered what OpenAI and others have been working on: Gluon.

Gluon: Triton’s Lower-Level Sibling 🔗

At first, I thought Gluon was just “Triton but harder.” I was wrong. It’s more like “Triton with the training wheels off.” Here’s what I found very interesting:

The Same Infrastructure, Different Philosophy 🔗

Gluon uses the same compiler infrastructure as Triton – same frontend, same backend. But it deliberately skips the optimization middle layer. Why would anyone want that?

Well, it turns out that sometimes the compiler’s optimizations are… suboptimal. And when you’re trying to squeeze out that last 20-40% of performance, you need control.

My First Gluon Kernel: A Humbling Experience 🔗

Here’s my first attempt at converting a simple Triton kernel to Gluon:

# What I wrote in Triton (simple and clean)
@triton.jit
def triton_memcpy(src, dst, N, BLOCK: tl.constexpr):
    pid = tl.program_id(0)
    offs = pid * BLOCK + tl.arange(0, BLOCK)
    mask = offs < N
    tl.store(dst + offs, tl.load(src + offs, mask=mask), mask=mask)

# What I had to write in Gluon (explicit everything)
@gluon.jit
def gluon_memcpy(src, dst, N, BLOCK: gl.constexpr):
    pid = gl.program_id(0)
    
    # I have to specify the layout?!
    layout = gl.BlockedLayout(
        size_per_thread=[1],
        threads_per_warp=[32],
        warps_per_cta=[4],
        order=[0]
    )
    
    offs = pid * BLOCK + gl.arange(0, BLOCK, layout=layout)
    mask = offs < N
    gl.store(dst + offs, gl.load(src + offs, mask=mask), mask=mask)

At first, I was frustrated. Why do I need to specify layouts? Isn’t that what the compiler is for?

The “Aha!” Moment 🔗

Then I read the benchmarks from the examples from the tutorials. The naive Triton memcpy from the intro tutorial: 666 GB/s. The Gluon version with a carefully chosen layout: 6,600 GB/s on GB200.

That’s not a typo. Nearly 10x improvement, straight from the tutorial benchmarks.

What’s Actually Happening? 🔗

Here’s what I learned after reading further:

Layouts matter. A lot. The way data is distributed across threads, warps, and thread blocks can make or break your memory bandwidth utilization.
The compiler can’t read your mind. Triton makes educated guesses about optimal layouts, but it doesn’t know if you’re optimizing for a memory-bound or compute-bound kernel, whether you care more about latency or throughput, or what specific access patterns your algorithm needs.
Modern GPUs are weird. Features like Tensor Memory Accelerator (TMA) on Hopper GPUs or the swizzled shared memory layouts can provide massive speedups, but only if you use them correctly.

Real-World Gluon: Where It Shines 🔗

After playing with Gluon for a few weeks, here’s where I’ve found it actually makes a difference:

1. Memory-Bound Operations with Weird Access Patterns 🔗

The tutorials have a fascinating example of handling non-contiguous tensor operations. In the 02-layouts.py tutorial, they demonstrate copying a strided tensor (every other row of an 8GB tensor) to make it contiguous:

# From the 02-layouts.py tutorial - handling non-contiguous memory patterns
# This example shows how Gluon can efficiently handle strided tensors
# that PyTorch's .contiguous() struggles with

# Setup: 8 GB tensor, taking every other row (non-contiguous view)
xnumel = 32 * 1024
ynumel = 64 * 1024
input = torch.randn((xnumel, ynumel), device="cuda")
input = input[::2]  # Take a view over every other row - now non-contiguous!
output = torch.empty_like(input)

# The tutorial compares three approaches:

# 1. Gluon 2D memcpy with row-major layout
layout = gl.BlockedLayout([1, 1], [1, 32], [1, 4], [1, 0])
# Result: 6.258 TB/s

# 2. PyTorch's built-in contiguous() method
# Result: 2.946 TB/s (over 2x slower!)

# 3. Gluon 2D memcpy with the "transposed trick" - using column-major layout
layout = gl.BlockedLayout([1, 1], [32, 1], [4, 1], [0, 1])
# Result: 6.398 TB/s (best performance)

The tutorial explains this performance difference comes from Gluon’s ability to choose optimal layouts for the specific memory access pattern. The “transposed trick” leverages better GPU scheduling and cache locality. Meanwhile, PyTorch’s generic contiguous() can’t optimize for this specific pattern, resulting in over 2x slower performance.

2. Fused Operations That Triton Can’t Figure Out 🔗

The tutorials demonstrate how Gluon enables complex fused operations. For instance, the layout conversion examples show how you can keep everything in registers and shared memory:

# From the 02-layouts.py tutorial - showing layout conversion for optimal memory access
@gluon.jit
def memcpy_2d_inout_kernel(in_ptr, out_ptr,  
                           xnumel, ynumel, xstride_in, ystride_in, xstride_out, ystride_out,  
                           layout_in: gl.constexpr, layout_out: gl.constexpr,  
                           XBLOCK: gl.constexpr, YBLOCK: gl.constexpr):
    # ... setup code ...
    
    # Load with one layout optimized for the input tensor
    value = gl.load(in_ptr + in_offsets, mask=mask_in)
    
    # Convert to a different layout optimized for the output tensor
    # This conversion happens in registers/shared memory, avoiding round-trips!
    value = gl.convert_layout(value, layout_out)
    
    # Store with the output-optimized layout
    gl.store(out_ptr + out_offsets, value, mask=mask_out)

# The tutorial shows this achieves 4.814 TB/s even with the layout conversion overhead
# compared to 0.978-1.674 TB/s when using mismatched layouts

3. Actually Using Modern GPU Features 🔗

This was the big one for me. Triton abstracts away features like async copies, warp specialization, and persistent kernels. The tutorials show exactly how to use these in Gluon:

# From the 03-async-copy.py tutorial - demonstrating async copy operations
# This shows how to overlap memory transfers with computation on Ampere+ GPUs

@gluon.jit
def memcpy_1d_cpasync_kernel(in_ptr, out_ptr, xnumel, XBLOCK: gl.constexpr):
    pid = gl.program_id(0)
    
    layout: gl.constexpr = gl.BlockedLayout([1], [32], [4], [0])
    offsets = pid * XBLOCK + gl.arange(0, XBLOCK, layout=layout)
    mask = offsets < xnumel
    
    # Allocate shared memory with specific layout to avoid bank conflicts
    smem_layout: gl.constexpr = gl.SwizzledSharedLayout(vec=1, per_phase=1, max_phase=1, order=[0])
    smem = gl.allocate_shared_memory(gl.float32, [XBLOCK], layout=smem_layout)
    
    # Issue the async copy - this starts in the background!
    cp.async_copy_global_to_shared(smem, in_ptr + offsets, mask=mask)
    cp.commit_group()
    
    # In a real kernel, you could do other work here while the copy happens
    # The tutorial mentions this is key for hiding memory latency
    
    # Wait until the async copy completes (0 = wait for all groups)
    cp.wait_group(0)
    
    # Now retrieve the data from shared memory
    value = smem.load(layout)
    gl.store(out_ptr + offsets, value, mask=mask)

# The tutorial notes this requires Ampere (compute capability 8.0) or newer
# and demonstrates the foundation for software pipelining in later tutorials

The Learning Curve (Or: Why This Isn’t For Everyone) 🔗

Let me be honest: Gluon is hard. Really hard. Here’s what you need to understand to be productive:

GPU Memory Hierarchy: Not just “global vs shared” but cache lines, sectors, banking, and swizzling.
Warp Execution Model: How warps actually execute, divergence, synchronization primitives.
Layout Theory: This is almost a field unto itself. The Gluon tutorials spend more time on layouts than anything else.
Hardware-Specific Features: Each GPU generation has its own quirks and features. A kernel optimized for A100 might be terrible on H100.

My Current Take: The Right Tool for the Right Job 🔗

After this journey, here’s how I think about the GPU programming stack:

flowchart TD
    A[PyTorch Operations] -->|"Need custom kernel?"| B{Performance Critical?}
    B -->|"No"| C[Triton]
    B -->|"Yes"| D{Is Triton fast enough?}
    D -->|"Yes"| C
    D -->|"No"| E{Do you have GPU expertise?}
    E -->|"No"| F[Optimize Triton / Hire Expert]
    E -->|"Yes"| G[Gluon]
    
    style A fill:#ffe4b5
    style C fill:#98fb98
    style G fill:#ff6b6b
    style F fill:#ffd700

Should you learn Gluon? 🔗

My main takeaway is that it’s not only about performance, some things are literally impossible in Triton but trivial in Gluon (like certain persistent kernel patterns).

Here’s my honest advice:

Learn Gluon if:

You’re already comfortable with Triton and hitting performance limits
You enjoy low-level optimization puzzles
Your workload genuinely needs that last 20% of performance
You’re curious about how GPUs actually work under the hood

Skip Gluon if:

You’re still learning GPU programming (stick with Triton)
Your kernels are already fast enough
You value development velocity over peak performance
You need portability across different GPU vendors

The Cross-Architecture Reality Check 🔗

Here’s another thing that’s been on my mind: the GPU landscape isn’t just NVIDIA anymore. With AMD’s MI300 series gaining traction and Intel’s attempts with Arc/Ponte Vecchio, writing architecture-specific code is becoming increasingly problematic. This is where Triton’s abstraction layer suddenly makes even more sense.

I recently read a fascinating Black Hat talk “How to Secure Unique Ecosystem Shipping 1 Billion+ Cores” by Adam Zabrocki and Marko Mitic from NVIDIA (slides here). Beyond the security implications, what struck me was their discussion of NVIDIA’s preparation for their RISC-V ecosystem with NVRISC-V. The GPU ecosystem is changing a lot, and will keep changing even more.

This makes me think that hand-tuned, architecture-specific optimization (whether in CUDA, ROCm, or even Gluon) might end up being like security architecture work – critical for a very small audience but not something most developers will ever touch. The future probably belongs to portable abstractions like Triton, with escape hatches like Gluon for when you absolutely need them.

What’s Next? 🔗

The Triton team seems committed to pushing Gluon forward. From what I gathered at the recent community meetup:

Better tooling is coming (current debugging tools are… spartan)
More examples and documentation are in the works
There’s talk of Gluon-Triton interop for hybrid kernels
The upcoming Triton Developer Conference will have significant Gluon content
Cross-architecture support remains a key focus for Triton (though Gluon will likely remain NVIDIA-specific)

My Takeaway 🔗

Gluon represents something interesting in the GPU programming world: an acknowledgment that sometimes, abstractions need escape hatches. It’s not trying to replace Triton any more than Triton is trying to replace PyTorch. It’s another tool in the toolbox, and for the right problems, it’s incredibly powerful.

Will I write all my kernels in Gluon? Absolutely not. Will I reach for it when Triton isn’t cutting it? Probably, but it’s very unlikely I’ll ever be in a scenario like this.

The GPU programming landscape is more nuanced than I initially thought. It’s not just “CUDA or bust” anymore. We have a whole spectrum of tools, each with its sweet spot. And honestly? That’s exactly what we need as we push the boundaries of what’s possible with modern AI and HPC workloads.

P.S. - If you’re interested in learning more, the Gluon tutorials are actually quite good, though prepare to read them multiple times.