Multi-GPU Programming with AMD's Iris Framework for Triton

Sep 28, 2025 · 2321 words · 11 minute read

GPU production constraints are creating infrastructure bottlenecks. Multi-GPU programming, particularly vendor-agnostic implementations, has become essential. In their GPU Mode presentation, AMD Research engineers Muhammad Awad, Muhammad Osama, and Brandon Potter introduced Iris—a Python library that enables fine-grained multi-GPU programming in Triton. Similarly to my previous Gluon blogpost, this post captures my understanding and interpretation of their work, serving as both technical documentation and personal reference for this emerging multi-GPU programming paradigm.

Technical Problem 🔗

Current multi-GPU programming uses bulk synchronous models (BSP) through libraries like NCCL. This model enforces sequential phases:

Complete all computation
Synchronize on host
Execute communication kernel
Synchronize again
Resume computation

This pattern wastes GPU cycles and requires CPU intervention for every communication phase. The fundamental limitation: GPUs cannot initiate communication directly. Every data transfer requires host orchestration, creating unnecessary synchronization points and preventing fine-grained overlap.

The AMD team built Iris to enable GPU-initiated communication, allowing kernels to directly orchestrate multi-GPU operations without host intervention.

Iris Architecture: GPU-Initiated Communication 🔗

Iris fundamentally changes the multi-GPU programming model by enabling device-side communication primitives. GPUs can directly initiate loads, stores, and atomic operations to remote GPUs without CPU involvement. This eliminates the host-device synchronization bottleneck.

Address Space Remapping 🔗

Key Insight: The __translate function in Iris implements linear address remapping between GPU address spaces, kind of similar to how Linux kernel’s virt_to_phys() or __pa() macros perform simple offset-based translations. Since there is no need for complex page table walks, Iris uses direct offset arithmetic—each GPU’s heap starts at a different base address, and translation is simply calculating the offset from one base and applying it to another.

For readers familiar with Linux/Windows kernel memory management, this is conceptually similar to kernel address translation in Linux (pre cr3 resolution):

// Linux kernel macros for simple address translation
#define __pa(x) ((unsigned long)(x) - PAGE_OFFSET)
#define __va(x) ((void *)((unsigned long)(x) + PAGE_OFFSET))

Iris implements the same concept for multi-GPU systems:

GPU_0_ptr = base_0 + offset
GPU_1_ptr = base_1 + offset  // Same offset, different base

The translation is deterministic and requires only:

Array of heap base addresses (one per GPU)
Simple arithmetic (subtract source base, add destination base)
No page tables, no TLB, no page faults
Direct memory access via XGMI/PCIe interconnect

flowchart LR
    subgraph Linux["Linux Kernel __pa()"]
        KV[Kernel Virtual: 0xFFFF888012345678]
        PO[PAGE_OFFSET: 0xFFFF888000000000]
        PA[Physical: 0x12345678]
        
        KV -->|"addr - PAGE_OFFSET"| PA
    end
    
    subgraph Iris["Iris __translate()"]
        G0[GPU0 Ptr: 0x7F0000012345]
        B0[Base0: 0x7F0000000000]
        OFF[Offset: 0x12345]
        B1[Base1: 0x7F8000000000]
        G1[GPU1 Ptr: 0x7F8000012345]
        
        G0 -->|"ptr - base0"| OFF
        OFF -->|"base1 + offset"| G1
    end
    
    Linux -.->|"Same concept:<br/>Linear offset translation"| Iris

The Symmetric Heap Implementation 🔗

Iris implements a symmetric heap—a Partitioned Global Address Space (PGAS) that provides unified memory addressing across GPUs. The key insight: any symmetric variable can be located on any GPU using just two offsets:

Heap Base Offset: Where each GPU’s heap starts in its virtual address space
Variable Offset: Where the variable sits within the symmetric heap (identical across all GPUs)

Symmetric Heap Diagram

The initialization process:

Heap Allocation: Each GPU allocates a local heap at a different base address
All-Gather Exchange: All GPUs share their heap base addresses
Symmetric Allocation: Variables allocated at same offset (e.g., TENSOR_X at 0x448) on all heaps
Translation Table: Each GPU maintains array of all heap bases for address translation

The core translation function demonstrates the elegance:

@triton.jit
def __translate(ptr, from_rank, to_rank, heap_bases):
    from_base = tl.load(heap_bases + from_rank)
    to_base = tl.load(heap_bases + to_rank)
    
    # Convert pointer to integer for arithmetic
    ptr_int = tl.cast(ptr, tl.uint64)
    
    # Calculate offset in source GPU's heap
    offset = ptr_int - from_base
    
    # Apply offset to destination GPU's heap base
    to_base_byte = tl.cast(to_base, tl.pointer_type(tl.int8))
    translated_ptr_byte = to_base_byte + offset
    
    # Cast back to original pointer type
    translated_ptr = tl.cast(translated_ptr_byte, ptr.dtype)
    return translated_ptr

Translation Example: Accessing TENSOR_X Across GPUs 🔗

Using the actual addresses from the symmetric heap:

# TENSOR_X is at offset 0x448 on all GPUs
# GPU 0 wants to access TENSOR_X on GPU 1

# Step 1: GPU 0's view
tensor_x_gpu0 = 0xFFFCABC0 + 0x448  # = 0xFFFCB008

# Step 2: Calculate offset from GPU 0's heap base
offset = 0xFFFCB008 - 0xFFFCABC0    # = 0x448

# Step 3: Apply offset to GPU 1's heap base
tensor_x_gpu1 = 0xFFFC0420 + 0x448  # = 0xFFFC0868

# Result: GPU 0 can directly access GPU 1's TENSOR_X at 0xFFFC0868

flowchart LR
    subgraph Translation["Address Translation for TENSOR_X"]
        GPU0_ADDR[GPU0 Address: 0xFFFCB008]
        GPU0_BASE[GPU0 Base: 0xFFFCABC0]
        OFFSET[Offset: 0x448]
        GPU1_BASE[GPU1 Base: 0xFFFC0420]
        GPU1_ADDR[GPU1 Address: 0xFFFC0868]
        
        GPU0_ADDR -->|"subtract base"| OFFSET
        GPU0_BASE --> OFFSET
        OFFSET -->|"add to new base"| GPU1_ADDR
        GPU1_BASE --> GPU1_ADDR
    end
    
    Result[Direct Memory Access via XGMI]
    GPU1_ADDR --> Result

Why Linear Address Translation Works 🔗

The brilliance of Iris’s approach is its simplicity. Rather than implementing complex virtual memory systems, Iris recognizes that GPUs already have:

Flat memory model: Each GPU sees a contiguous address space
Hardware coherence: XGMI/NVLink maintains cache coherency
Direct addressing: GPUs can access any address in their space

By using simple offset-based translation (like Linux’s __pa()), Iris achieves:

Zero abstraction overhead: Just pointer arithmetic
Predictable performance: No TLB misses or page faults
Hardware efficiency: Leverages existing GPU memory controllers
Symmetric design: Every GPU uses identical heap layout

This is fundamentally different from traditional distributed memory systems that require:

Complex routing tables
Multiple indirection levels
Software-managed coherence
Message serialization/deserialization

Iris proves that multi-GPU memory management doesn’t need complexity—it needs the right primitive: linear address remapping.

Performance validation shows:

XGMI bandwidth: 96.3% of theoretical maximum
HBM bandwidth: 93.3% for local access
Translation overhead: <4% compared to direct access

flowchart TB
    subgraph BSP["Bulk Synchronous (Traditional)"]
        A[Compute] --> B[Barrier]
        B --> C[Communicate]
        C --> D[Barrier]
        D --> E[Compute]
    end
    
    subgraph Iris["Fine-Grained (Iris)"]
        F[Unified Kernel]
        F --> G[Compute + Store to Remote]
        G --> H[Continue Compute]
    end

Implementation Patterns 🔗

Iris supports four execution patterns for computation-communication overlap:

Pattern 1: Bulk Synchronous 🔗

Traditional sequential execution. Baseline for comparison.

Launch compute kernel
Barrier
Launch communication kernel
Barrier

Pattern 2: Producer-Consumer 🔗

Partition compute units between computation and communication.

Assign N CUs for compute
Assign M CUs for communication
Use atomics for synchronization
Achieves up to 2.5x speedup

Pattern 3: Sequential Fusion 🔗

Single kernel performs computation then communication.

No intermediate memory access
Higher register pressure
1.2-1.5x speedup for small tiles

Pattern 4: Work Group Specialization 🔗

Single kernel with internal branching based on block ID.

if block_id < compute_blocks:
    do_computation()
else:
    do_communication()

Best of producer-consumer without multiple kernels
1.6x average speedup

Memory Model and Synchronization Primitives 🔗

Device-Side Atomics 🔗

Iris provides GPU-native atomic operations that work across the memory hierarchy. These aren’t host-controlled barriers—they’re fine-grained synchronization primitives that GPUs execute directly.

Supported atomics:

atomic_cas (compare-and-swap)
atomic_add, atomic_xchg, atomic_min/max
atomic_and, atomic_or, atomic_xor

Memory Ordering Semantics 🔗

Iris implements the full memory model with acquire-release semantics:

# Producer GPU
data = compute_tile()
iris.store(data, to_rank=1, heap_bases=bases)
iris.atomic_cas(flag, 0, 1, sem="release", scope="system")

# Consumer GPU
while iris.atomic_cas(flag, 1, 1, sem="acquire", scope="system") != 1:
    pass
data = iris.load(from_rank=0, heap_bases=bases)

The release operation prevents reordering of stores before the flag set. The acquire operation prevents loads from executing before flag check.

Scope Hierarchy 🔗

Iris exposes the full GPU memory scope hierarchy:

Wavefront: Synchronization within a single wavefront
Workgroup: Between threads in the same workgroup
GPU: Across all workgroups on the same GPU
System: Cross-GPU synchronization via XGMI
World (planned): Cross-node via RDMA

This explicit scope control enables developers to choose the minimal synchronization overhead for their use case.

Advanced Cache Management 🔗

Iris exposes cache placement controls typically available only in assembly. This level of control is critical for multi-GPU performance—remote data often has different reuse patterns than local data.

# Write-through for non-temporal data
iris.store(data, cache_modifier="wt")

# Cache in L2 for reused data
iris.load(data, cache_modifier="ca:L2")

Cache modifier options:

Write-through (wt): Bypass L1/L2 for non-temporal data
Cache at L2 (ca:L2): Keep data in L2, bypass L1
Non-cached (nc): Direct to memory, no caching
Write-back (wb): Normal caching behavior

Example optimization for tall-skinny matrix multiplication:

# Small matrix fits in L2 - cache it
B = iris.load(B_ptr, cache_modifier="ca:L2")
# Large matrix streams through - don't pollute cache
A = iris.load(A_ptr, cache_modifier="wt")

This prevents cache thrashing and can improve performance by 15-20% for memory-bound kernels.

Performance Analysis 🔗

Benchmark results on AMD MI300X:

Point-to-point bandwidth: 93-96% of theoretical
GEMM+AllScatter: 1.2-2.5x speedup vs NCCL
Flash Decode: 1.4x speedup with fused kernels
Overhead: ~4% vs raw assembly

Key factors:

Pattern selection depends on tile size
Optimal CU partitioning varies by workload
Register pressure limits fusion benefits

Real-World Code Examples 🔗

Basic Producer-Consumer Pattern 🔗

From the Iris examples, here’s a complete producer-consumer implementation:

@triton.jit
def producer_kernel(source, target, flag, size, 
                    producer_rank: tl.constexpr,
                    consumer_rank: tl.constexpr,
                    BLOCK_SIZE: tl.constexpr,
                    heap_bases_ptr):
    pid = tl.program_id(0)
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask = offsets < size
    
    # Load from local memory
    values = iris.load(source + offsets, producer_rank, 
                      producer_rank, heap_bases_ptr, mask=mask)
    
    # Store to remote GPU
    iris.store(target + offsets, values, producer_rank,
              consumer_rank, heap_bases_ptr, mask=mask)
    
    # Signal completion with flag
    tl.store(flag + pid, 1)

@triton.jit
def consumer_kernel(buffer, flag, size,
                   consumer_rank: tl.constexpr,
                   BLOCK_SIZE: tl.constexpr,
                   heap_bases_ptr):
    pid = tl.program_id(0)
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    
    # Spin-wait for producer
    done = tl.load(flag + pid)
    while done == 0:
        done = tl.load(flag + pid)
    
    # Load data from local memory (written by producer)
    values = iris.load(buffer + offsets, consumer_rank,
                      consumer_rank, heap_bases_ptr, 
                      mask=offsets < size)

Atomic Operations for Synchronization 🔗

Iris provides GPU-native atomic operations that work across GPUs:

@triton.jit
def atomic_add_kernel(source, result, size,
                     source_rank: tl.constexpr,
                     dest_rank: tl.constexpr,
                     BLOCK_SIZE: tl.constexpr,
                     heap_bases_ptr):
    pid = tl.program_id(0)
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask = offsets < size
    
    # Atomic add across GPUs
    result = iris.atomic_add(source + offsets, 1,
                            source_rank, dest_rank,
                            heap_bases_ptr, mask=mask,
                            sem="relaxed", scope="sys")

Work Group Specialization for GEMM 🔗

The most sophisticated pattern - splitting work between compute and communication:

@triton.jit
def gemm_all_scatter_wg_specialization(A, B, C, locks,
                                       GEMM_SMS: tl.constexpr,
                                       NUM_SMS: tl.constexpr,
                                       heap_bases, cur_rank):
    pid = tl.program_id(0)
    
    # Workgroup specialization
    if pid < GEMM_SMS:
        # Compute path
        for tile_id in range(pid, total_tiles, GEMM_SMS):
            # Perform GEMM computation
            acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
            for k in range(0, K, BLOCK_K):
                a = tl.load(A_ptr)
                b = tl.load(B_ptr)
                acc += tl.dot(a, b)
            
            # Store with write-through cache modifier
            tl.store(c_global + offset, acc, cache_modifier=".wt")
            tl.debug_barrier()
            tl.store(locks + tile_id, 1, cache_modifier=".wt")
    else:
        # Communication path
        COMM_SMS = NUM_SMS - GEMM_SMS
        for tile_id in range(total_tiles):
            # Wait for compute to complete
            while iris.atomic_cas(locks + tile_id, 1, 1,
                                 sem="acquire", scope="gpu") != 1:
                pass
            
            # Scatter to all ranks
            for rank in range(world_size):
                if rank != cur_rank:
                    iris.store(remote_ptr, local_data,
                             cur_rank, rank, heap_bases)

Practical Setup 🔗

import iris
import torch.distributed as dist

# Initialize distributed environment
dist.init_process_group(backend="nccl")

# Create Iris instance with custom heap size
ir = iris.Iris(heap_size=1<<33)  # 8GB heap

# Allocate tensors on symmetric heap
data = iris.rand((1024, 1024), dtype=torch.float16)
flags = iris.zeros(num_tiles, dtype=torch.int32)

# Get heap bases for kernel launches
heap_bases = ir.get_heap_bases()

Pattern Selection Guidelines 🔗

Based on the Iris benchmarks across different GEMM shapes:

Tile Size	Best Pattern	Expected Speedup	CU Split
<16KB	Sequential Fusion	1.2-1.5x	N/A
16-64KB	Work Group Spec	1.6-1.8x	80/20
>64KB	Producer-Consumer	1.8-2.5x	70/30
Variable	Dynamic Queue	1.5-2.0x	Dynamic

Optimization Checklist 🔗

Profile computation-to-communication ratio
Experiment with CU partitioning (start with 70/30 split)
Use write-through for producer data
Place consumer data in L2 cache
Align tile sizes to cache lines

Current Limitations and Solutions 🔗

Limitation	Current State	Solution in Development
Single-node only	XGMI/NVLink only	RDMA support for multi-node
Manual heap allocation	Explicit iris.alloc()	Automatic heap management
Pattern selection	Manual choice	Analytical model for auto-selection
CU partitioning	Manual tuning	Work-queue dynamic scheduling
Triton-specific	Python/Triton only	C++ API planned

Available Examples in Iris 🔗

The Iris repository includes 14 complete examples demonstrating various patterns:

Basic Operations (00-05): Load, store, atomic operations
Message Passing (06): Producer-consumer with flags
GEMM Patterns (07-12):
- All-scatter with different strategies
- Atomic-based all-reduce
- One-shot all-reduce
- Work group specialization
- Producer-consumer variants
- Bulk synchronous baseline
Flash Decode (13): Attention mechanism with multi-GPU

Each example includes benchmarking code and validation against reference implementations.

Roadmap 🔗

The AMD team is working on:

Multi-node support via RDMA (“world” scope)
Automatic pattern selection via analytical models
Integration with vLLM and inference frameworks
Reusable collective operation library
Cross-vendor abstraction layer
C++ API for non-Python environments

GPU-Initiated Communication: The Paradigm Shift 🔗

The key innovation in Iris is GPU-initiated communication. Traditional frameworks require the CPU to orchestrate every multi-GPU operation. Iris inverts this model:

Traditional (Host-Initiated)	Iris (GPU-Initiated)
CPU launches compute kernel	GPU executes unified kernel
CPU waits for completion	GPU computes tile
CPU launches comm kernel	GPU directly stores to remote
CPU synchronizes GPUs	GPU sets atomic flag
CPU launches next kernel	Remote GPU polls and consumes

This eliminates thousands of CPU-GPU round trips per second in typical workloads.

Vendor-Agnostic Future: A Personal Perspective 🔗

I strongly believe in a vendor-agnostic GPU future. While Iris currently focuses on AMD hardware and Gluon (as discussed in my previous post) targets NVIDIA’s Blackwell architecture, the underlying principles are universal. The core abstractions that make these frameworks powerful aren’t vendor-specific:

Linear address translation: Simple offset arithmetic works everywhere
Device-side atomics: Every modern GPU has compare-and-swap
Direct memory access: PCIe, XGMI, NVLink all provide coherent interconnects
Flat memory models: GPUs fundamentally see contiguous address spaces

Looking ahead to 2026, I expect we’ll see convergence. The straight forward implementation and Gluon’s low-level control demonstrate that efficient multi-GPU programming doesn’t require vendor lock-in—it requires the right primitives. As GPU availability becomes increasingly unpredictable and new vendors enter the market (Huawei? RISCV GPUs?), frameworks that abstract vendor differences while maintaining performance will become critical infrastructure.

Conclusion 🔗

Iris represents a fundamental rethinking of multi-GPU programming. Three core innovations make this possible:

GPU-Initiated Communication: Eliminates CPU bottlenecks by allowing GPUs to directly orchestrate multi-GPU operations
Symmetric Heap: Provides zero-copy remote access through elegant address translation
Device-Side Synchronization: Enables fine-grained producer-consumer patterns without kernel boundaries

The implementation achieves 96% of theoretical bandwidth—proof that current multi-GPU frameworks are over-engineered. As hardware vendors proliferate and GPU availability becomes unpredictable, Iris’s primitives-first approach offers a path to vendor-agnostic multi-GPU programming.

The open-source implementation (GitHub) provides both a production-ready tool and a reference architecture for next-generation multi-GPU frameworks. The simplicity of the core translation function—three lines that enable cross-GPU memory access—demonstrates that the right abstractions matter more than code volume.