GPU production constraints are creating infrastructure bottlenecks. Multi-GPU programming, particularly vendor-agnostic implementations, has become essential. In their GPU Mode presentation, AMD Research engineers Muhammad Awad, Muhammad Osama, and Brandon Potter introduced Iris—a Python library that enables fine-grained multi-GPU programming in Triton. Similarly to my previous Gluon blogpost, this post captures my understanding and interpretation of their work, serving as both technical documentation and personal reference for this emerging multi-GPU programming paradigm.
Technical Problem 🔗
Current multi-GPU programming uses bulk synchronous models (BSP) through libraries like NCCL. This model enforces sequential phases:
- Complete all computation
- Synchronize on host
- Execute communication kernel
- Synchronize again
- Resume computation
This pattern wastes GPU cycles and requires CPU intervention for every communication phase. The fundamental limitation: GPUs cannot initiate communication directly. Every data transfer requires host orchestration, creating unnecessary synchronization points and preventing fine-grained overlap.
The AMD team built Iris to enable GPU-initiated communication, allowing kernels to directly orchestrate multi-GPU operations without host intervention.
Iris Architecture: GPU-Initiated Communication 🔗
Iris fundamentally changes the multi-GPU programming model by enabling device-side communication primitives. GPUs can directly initiate loads, stores, and atomic operations to remote GPUs without CPU involvement. This eliminates the host-device synchronization bottleneck.
Address Space Remapping 🔗
Key Insight: The __translate function in Iris implements linear address remapping between GPU address spaces, kind of similar to how Linux kernel’s virt_to_phys() or __pa() macros perform simple offset-based translations. Since there is no need for complex page table walks, Iris uses direct offset arithmetic—each GPU’s heap starts at a different base address, and translation is simply calculating the offset from one base and applying it to another.
For readers familiar with Linux/Windows kernel memory management, this is conceptually similar to kernel address translation in Linux (pre cr3 resolution):
// Linux kernel macros for simple address translation
#define __pa(x) ((unsigned long)(x) - PAGE_OFFSET)
#define __va(x) ((void *)((unsigned long)(x) + PAGE_OFFSET))
Iris implements the same concept for multi-GPU systems:
GPU_0_ptr = base_0 + offset
GPU_1_ptr = base_1 + offset // Same offset, different base
The translation is deterministic and requires only:
- Array of heap base addresses (one per GPU)
- Simple arithmetic (subtract source base, add destination base)
- No page tables, no TLB, no page faults
- Direct memory access via XGMI/PCIe interconnect
flowchart LR
subgraph Linux["Linux Kernel __pa()"]
KV[Kernel Virtual: 0xFFFF888012345678]
PO[PAGE_OFFSET: 0xFFFF888000000000]
PA[Physical: 0x12345678]
KV -->|"addr - PAGE_OFFSET"| PA
end
subgraph Iris["Iris __translate()"]
G0[GPU0 Ptr: 0x7F0000012345]
B0[Base0: 0x7F0000000000]
OFF[Offset: 0x12345]
B1[Base1: 0x7F8000000000]
G1[GPU1 Ptr: 0x7F8000012345]
G0 -->|"ptr - base0"| OFF
OFF -->|"base1 + offset"| G1
end
Linux -.->|"Same concept:<br/>Linear offset translation"| Iris
The Symmetric Heap Implementation 🔗
Iris implements a symmetric heap—a Partitioned Global Address Space (PGAS) that provides unified memory addressing across GPUs. The key insight: any symmetric variable can be located on any GPU using just two offsets:
- Heap Base Offset: Where each GPU’s heap starts in its virtual address space
- Variable Offset: Where the variable sits within the symmetric heap (identical across all GPUs)
The initialization process:
- Heap Allocation: Each GPU allocates a local heap at a different base address
- All-Gather Exchange: All GPUs share their heap base addresses
- Symmetric Allocation: Variables allocated at same offset (e.g., TENSOR_X at 0x448) on all heaps
- Translation Table: Each GPU maintains array of all heap bases for address translation
The core translation function demonstrates the elegance:
@triton.jit
def __translate(ptr, from_rank, to_rank, heap_bases):
from_base = tl.load(heap_bases + from_rank)
to_base = tl.load(heap_bases + to_rank)
# Convert pointer to integer for arithmetic
ptr_int = tl.cast(ptr, tl.uint64)
# Calculate offset in source GPU's heap
offset = ptr_int - from_base
# Apply offset to destination GPU's heap base
to_base_byte = tl.cast(to_base, tl.pointer_type(tl.int8))
translated_ptr_byte = to_base_byte + offset
# Cast back to original pointer type
translated_ptr = tl.cast(translated_ptr_byte, ptr.dtype)
return translated_ptr
Translation Example: Accessing TENSOR_X Across GPUs 🔗
Using the actual addresses from the symmetric heap:
# TENSOR_X is at offset 0x448 on all GPUs
# GPU 0 wants to access TENSOR_X on GPU 1
# Step 1: GPU 0's view
tensor_x_gpu0 = 0xFFFCABC0 + 0x448 # = 0xFFFCB008
# Step 2: Calculate offset from GPU 0's heap base
offset = 0xFFFCB008 - 0xFFFCABC0 # = 0x448
# Step 3: Apply offset to GPU 1's heap base
tensor_x_gpu1 = 0xFFFC0420 + 0x448 # = 0xFFFC0868
# Result: GPU 0 can directly access GPU 1's TENSOR_X at 0xFFFC0868
flowchart LR
subgraph Translation["Address Translation for TENSOR_X"]
GPU0_ADDR[GPU0 Address: 0xFFFCB008]
GPU0_BASE[GPU0 Base: 0xFFFCABC0]
OFFSET[Offset: 0x448]
GPU1_BASE[GPU1 Base: 0xFFFC0420]
GPU1_ADDR[GPU1 Address: 0xFFFC0868]
GPU0_ADDR -->|"subtract base"| OFFSET
GPU0_BASE --> OFFSET
OFFSET -->|"add to new base"| GPU1_ADDR
GPU1_BASE --> GPU1_ADDR
end
Result[Direct Memory Access via XGMI]
GPU1_ADDR --> Result
Why Linear Address Translation Works 🔗
The brilliance of Iris’s approach is its simplicity. Rather than implementing complex virtual memory systems, Iris recognizes that GPUs already have:
- Flat memory model: Each GPU sees a contiguous address space
- Hardware coherence: XGMI/NVLink maintains cache coherency
- Direct addressing: GPUs can access any address in their space
By using simple offset-based translation (like Linux’s __pa()), Iris achieves:
- Zero abstraction overhead: Just pointer arithmetic
- Predictable performance: No TLB misses or page faults
- Hardware efficiency: Leverages existing GPU memory controllers
- Symmetric design: Every GPU uses identical heap layout
This is fundamentally different from traditional distributed memory systems that require:
- Complex routing tables
- Multiple indirection levels
- Software-managed coherence
- Message serialization/deserialization
Iris proves that multi-GPU memory management doesn’t need complexity—it needs the right primitive: linear address remapping.
Performance validation shows:
- XGMI bandwidth: 96.3% of theoretical maximum
- HBM bandwidth: 93.3% for local access
- Translation overhead: <4% compared to direct access
flowchart TB
subgraph BSP["Bulk Synchronous (Traditional)"]
A[Compute] --> B[Barrier]
B --> C[Communicate]
C --> D[Barrier]
D --> E[Compute]
end
subgraph Iris["Fine-Grained (Iris)"]
F[Unified Kernel]
F --> G[Compute + Store to Remote]
G --> H[Continue Compute]
end
Implementation Patterns 🔗
Iris supports four execution patterns for computation-communication overlap:
Pattern 1: Bulk Synchronous 🔗
Traditional sequential execution. Baseline for comparison.
- Launch compute kernel
- Barrier
- Launch communication kernel
- Barrier
Pattern 2: Producer-Consumer 🔗
Partition compute units between computation and communication.
- Assign N CUs for compute
- Assign M CUs for communication
- Use atomics for synchronization
- Achieves up to 2.5x speedup
Pattern 3: Sequential Fusion 🔗
Single kernel performs computation then communication.
- No intermediate memory access
- Higher register pressure
- 1.2-1.5x speedup for small tiles
Pattern 4: Work Group Specialization 🔗
Single kernel with internal branching based on block ID.
if block_id < compute_blocks:
do_computation()
else:
do_communication()
- Best of producer-consumer without multiple kernels
- 1.6x average speedup
Memory Model and Synchronization Primitives 🔗
Device-Side Atomics 🔗
Iris provides GPU-native atomic operations that work across the memory hierarchy. These aren’t host-controlled barriers—they’re fine-grained synchronization primitives that GPUs execute directly.
Supported atomics:
atomic_cas(compare-and-swap)atomic_add,atomic_xchg,atomic_min/maxatomic_and,atomic_or,atomic_xor
Memory Ordering Semantics 🔗
Iris implements the full memory model with acquire-release semantics:
# Producer GPU
data = compute_tile()
iris.store(data, to_rank=1, heap_bases=bases)
iris.atomic_cas(flag, 0, 1, sem="release", scope="system")
# Consumer GPU
while iris.atomic_cas(flag, 1, 1, sem="acquire", scope="system") != 1:
pass
data = iris.load(from_rank=0, heap_bases=bases)
The release operation prevents reordering of stores before the flag set. The acquire operation prevents loads from executing before flag check.
Scope Hierarchy 🔗
Iris exposes the full GPU memory scope hierarchy:
- Wavefront: Synchronization within a single wavefront
- Workgroup: Between threads in the same workgroup
- GPU: Across all workgroups on the same GPU
- System: Cross-GPU synchronization via XGMI
- World (planned): Cross-node via RDMA
This explicit scope control enables developers to choose the minimal synchronization overhead for their use case.
Advanced Cache Management 🔗
Iris exposes cache placement controls typically available only in assembly. This level of control is critical for multi-GPU performance—remote data often has different reuse patterns than local data.
# Write-through for non-temporal data
iris.store(data, cache_modifier="wt")
# Cache in L2 for reused data
iris.load(data, cache_modifier="ca:L2")
Cache modifier options:
- Write-through (wt): Bypass L1/L2 for non-temporal data
- Cache at L2 (ca:L2): Keep data in L2, bypass L1
- Non-cached (nc): Direct to memory, no caching
- Write-back (wb): Normal caching behavior
Example optimization for tall-skinny matrix multiplication:
# Small matrix fits in L2 - cache it
B = iris.load(B_ptr, cache_modifier="ca:L2")
# Large matrix streams through - don't pollute cache
A = iris.load(A_ptr, cache_modifier="wt")
This prevents cache thrashing and can improve performance by 15-20% for memory-bound kernels.
Performance Analysis 🔗
Benchmark results on AMD MI300X:
- Point-to-point bandwidth: 93-96% of theoretical
- GEMM+AllScatter: 1.2-2.5x speedup vs NCCL
- Flash Decode: 1.4x speedup with fused kernels
- Overhead: ~4% vs raw assembly
Key factors:
- Pattern selection depends on tile size
- Optimal CU partitioning varies by workload
- Register pressure limits fusion benefits
Real-World Code Examples 🔗
Basic Producer-Consumer Pattern 🔗
From the Iris examples, here’s a complete producer-consumer implementation:
@triton.jit
def producer_kernel(source, target, flag, size,
producer_rank: tl.constexpr,
consumer_rank: tl.constexpr,
BLOCK_SIZE: tl.constexpr,
heap_bases_ptr):
pid = tl.program_id(0)
offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
mask = offsets < size
# Load from local memory
values = iris.load(source + offsets, producer_rank,
producer_rank, heap_bases_ptr, mask=mask)
# Store to remote GPU
iris.store(target + offsets, values, producer_rank,
consumer_rank, heap_bases_ptr, mask=mask)
# Signal completion with flag
tl.store(flag + pid, 1)
@triton.jit
def consumer_kernel(buffer, flag, size,
consumer_rank: tl.constexpr,
BLOCK_SIZE: tl.constexpr,
heap_bases_ptr):
pid = tl.program_id(0)
offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
# Spin-wait for producer
done = tl.load(flag + pid)
while done == 0:
done = tl.load(flag + pid)
# Load data from local memory (written by producer)
values = iris.load(buffer + offsets, consumer_rank,
consumer_rank, heap_bases_ptr,
mask=offsets < size)
Atomic Operations for Synchronization 🔗
Iris provides GPU-native atomic operations that work across GPUs:
@triton.jit
def atomic_add_kernel(source, result, size,
source_rank: tl.constexpr,
dest_rank: tl.constexpr,
BLOCK_SIZE: tl.constexpr,
heap_bases_ptr):
pid = tl.program_id(0)
offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
mask = offsets < size
# Atomic add across GPUs
result = iris.atomic_add(source + offsets, 1,
source_rank, dest_rank,
heap_bases_ptr, mask=mask,
sem="relaxed", scope="sys")
Work Group Specialization for GEMM 🔗
The most sophisticated pattern - splitting work between compute and communication:
@triton.jit
def gemm_all_scatter_wg_specialization(A, B, C, locks,
GEMM_SMS: tl.constexpr,
NUM_SMS: tl.constexpr,
heap_bases, cur_rank):
pid = tl.program_id(0)
# Workgroup specialization
if pid < GEMM_SMS:
# Compute path
for tile_id in range(pid, total_tiles, GEMM_SMS):
# Perform GEMM computation
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
for k in range(0, K, BLOCK_K):
a = tl.load(A_ptr)
b = tl.load(B_ptr)
acc += tl.dot(a, b)
# Store with write-through cache modifier
tl.store(c_global + offset, acc, cache_modifier=".wt")
tl.debug_barrier()
tl.store(locks + tile_id, 1, cache_modifier=".wt")
else:
# Communication path
COMM_SMS = NUM_SMS - GEMM_SMS
for tile_id in range(total_tiles):
# Wait for compute to complete
while iris.atomic_cas(locks + tile_id, 1, 1,
sem="acquire", scope="gpu") != 1:
pass
# Scatter to all ranks
for rank in range(world_size):
if rank != cur_rank:
iris.store(remote_ptr, local_data,
cur_rank, rank, heap_bases)
Practical Setup 🔗
import iris
import torch.distributed as dist
# Initialize distributed environment
dist.init_process_group(backend="nccl")
# Create Iris instance with custom heap size
ir = iris.Iris(heap_size=1<<33) # 8GB heap
# Allocate tensors on symmetric heap
data = iris.rand((1024, 1024), dtype=torch.float16)
flags = iris.zeros(num_tiles, dtype=torch.int32)
# Get heap bases for kernel launches
heap_bases = ir.get_heap_bases()
Pattern Selection Guidelines 🔗
Based on the Iris benchmarks across different GEMM shapes:
| Tile Size | Best Pattern | Expected Speedup | CU Split |
|---|---|---|---|
| <16KB | Sequential Fusion | 1.2-1.5x | N/A |
| 16-64KB | Work Group Spec | 1.6-1.8x | 80/20 |
| >64KB | Producer-Consumer | 1.8-2.5x | 70/30 |
| Variable | Dynamic Queue | 1.5-2.0x | Dynamic |
Optimization Checklist 🔗
- Profile computation-to-communication ratio
- Experiment with CU partitioning (start with 70/30 split)
- Use write-through for producer data
- Place consumer data in L2 cache
- Align tile sizes to cache lines
Current Limitations and Solutions 🔗
| Limitation | Current State | Solution in Development |
|---|---|---|
| Single-node only | XGMI/NVLink only | RDMA support for multi-node |
| Manual heap allocation | Explicit iris.alloc() | Automatic heap management |
| Pattern selection | Manual choice | Analytical model for auto-selection |
| CU partitioning | Manual tuning | Work-queue dynamic scheduling |
| Triton-specific | Python/Triton only | C++ API planned |
Available Examples in Iris 🔗
The Iris repository includes 14 complete examples demonstrating various patterns:
- Basic Operations (00-05): Load, store, atomic operations
- Message Passing (06): Producer-consumer with flags
- GEMM Patterns (07-12):
- All-scatter with different strategies
- Atomic-based all-reduce
- One-shot all-reduce
- Work group specialization
- Producer-consumer variants
- Bulk synchronous baseline
- Flash Decode (13): Attention mechanism with multi-GPU
Each example includes benchmarking code and validation against reference implementations.
Roadmap 🔗
The AMD team is working on:
- Multi-node support via RDMA (“world” scope)
- Automatic pattern selection via analytical models
- Integration with vLLM and inference frameworks
- Reusable collective operation library
- Cross-vendor abstraction layer
- C++ API for non-Python environments
GPU-Initiated Communication: The Paradigm Shift 🔗
The key innovation in Iris is GPU-initiated communication. Traditional frameworks require the CPU to orchestrate every multi-GPU operation. Iris inverts this model:
| Traditional (Host-Initiated) | Iris (GPU-Initiated) |
|---|---|
| CPU launches compute kernel | GPU executes unified kernel |
| CPU waits for completion | GPU computes tile |
| CPU launches comm kernel | GPU directly stores to remote |
| CPU synchronizes GPUs | GPU sets atomic flag |
| CPU launches next kernel | Remote GPU polls and consumes |
This eliminates thousands of CPU-GPU round trips per second in typical workloads.
Vendor-Agnostic Future: A Personal Perspective 🔗
I strongly believe in a vendor-agnostic GPU future. While Iris currently focuses on AMD hardware and Gluon (as discussed in my previous post) targets NVIDIA’s Blackwell architecture, the underlying principles are universal. The core abstractions that make these frameworks powerful aren’t vendor-specific:
- Linear address translation: Simple offset arithmetic works everywhere
- Device-side atomics: Every modern GPU has compare-and-swap
- Direct memory access: PCIe, XGMI, NVLink all provide coherent interconnects
- Flat memory models: GPUs fundamentally see contiguous address spaces
Looking ahead to 2026, I expect we’ll see convergence. The straight forward implementation and Gluon’s low-level control demonstrate that efficient multi-GPU programming doesn’t require vendor lock-in—it requires the right primitives. As GPU availability becomes increasingly unpredictable and new vendors enter the market (Huawei? RISCV GPUs?), frameworks that abstract vendor differences while maintaining performance will become critical infrastructure.
Conclusion 🔗
Iris represents a fundamental rethinking of multi-GPU programming. Three core innovations make this possible:
- GPU-Initiated Communication: Eliminates CPU bottlenecks by allowing GPUs to directly orchestrate multi-GPU operations
- Symmetric Heap: Provides zero-copy remote access through elegant address translation
- Device-Side Synchronization: Enables fine-grained producer-consumer patterns without kernel boundaries
The implementation achieves 96% of theoretical bandwidth—proof that current multi-GPU frameworks are over-engineered. As hardware vendors proliferate and GPU availability becomes unpredictable, Iris’s primitives-first approach offers a path to vendor-agnostic multi-GPU programming.
The open-source implementation (GitHub) provides both a production-ready tool and a reference architecture for next-generation multi-GPU frameworks. The simplicity of the core translation function—three lines that enable cross-GPU memory access—demonstrates that the right abstractions matter more than code volume.