The Hidden Math Bug That Makes AI Unpredictable

Sep 14, 2025 · 964 words · 5 minute read

This tweet from Awni Hannun demonstrates in one line of MLX code the nondeterminism phenomenon detailed in Thinking Machines' research. We will explore the PyTorch equivalent that reveals a fundamental issue in AI systems, because I’ve found that tweet extremely helpful to understand what the original blogpost was about.

Here's a one-line code summary in MLX of the @thinkymachines blog post on non-determinism in LLM inference.

I'd guess the difference is larger the lower the precision, as you get larger affects from non-associativity of FP math.

Interestingly, that implies that training at low… pic.twitter.com/jYcDK9GiLn
— Awni Hannun (@awnihannun) September 13, 2025

The One-Line Experiment 🔗

torch.all(A @ B == torch.stack([a @ B for a in A]))

Here’s a simple test that exposes why your AI model gives different answers to the same question. You can also refer to the modal.com’s notebook I wrote if you want to test it.

import torch

torch.manual_seed(42)
A = torch.randn(128, 256, dtype=torch.bfloat16, device='cuda')
B = torch.randn(256, 512, dtype=torch.bfloat16, device='cuda')

batched = A @ B
sequential = torch.stack([a @ B for a in A])

print("Are they equal?", torch.all(batched == sequential).item())
print("Max difference:", (batched - sequential).abs().max().item())
print("Mean difference:", (batched - sequential).abs().mean().item())
# Are they equal? False
# Max difference: 0.001953125
# Mean difference: 6.007030606269836e-08

Same mathematical operation. Different results. This isn’t a bug—it’s how modern ML frameworks work.

The Root Cause 🔗

When GPUs process matrix multiplications, they optimize differently based on batch size:

Batched operation (A @ B): Uses parallel reduction algorithms
Sequential operation ([a @ B for a in A]): Processes each matrix individually

Different computation orders lead to different floating-point rounding errors. Think of it like adding numbers in different orders:

(0.1 + 1e20) - 1e20 = 0
0.1 + (1e20 - 1e20) = 0.1

Same math, different results. This is referred as floating-point non-associativity in the original blogpost.

Why Your ChatGPT Responses Vary 🔗

Ever wonder why ChatGPT gives slightly different answers to identical prompts, even with temperature set to 0?

It’s not randomness in the model—it’s batch size variability:

Morning (low traffic): Your query processes in a small batch
Peak hours (high traffic): Your query joins a large batch
Different batch size = Different computation path = Different result

In tests with a 235B parameter model (Qwen/Qwen3-235B-A22B-Instruct-2507), Thinking Machines researchers found 80 unique outputs from 1000 identical requests. The responses matched for 102 tokens, then diverged purely due to computational differences.

This might explain why some users report that Claude Code feels less reliable during daytime hours—when server load is higher, batch sizes change, leading to subtly different model behaviors:

some of anthropic rugpulls so far:

> 5 years retention of all conversations and code, all data will be used for training

> 1.58-bit quantized models during daytime

> plus not getting opus 4 in claude code

> max plans limits cut in half 6 weeks ago, no comms

> weekly limits… pic.twitter.com/Ss18u4x5Dz
— Ahmad (@TheAhmadOsman) August 29, 2025

The Precision Cascade 🔗

The problem worsens as we push for efficiency with lower precision:

graph TD
    Float64["float64<br/>53-bit mantissa"] --> Float32["float32<br/>23-bit mantissa"]
    Float32 --> Float16["float16<br/>10-bit mantissa"]
    Float16 --> BFloat16["bfloat16<br/>7-bit mantissa"]
    BFloat16 --> FP8["FP8 (E4M3/E5M2)<br/>3-2 bit mantissa"]
    FP8 --> NVFP4["NVFP4<br/>2-bit mantissa<br/>(June 2024)"]
    NVFP4 --> Int4["int4<br/>No decimals"]
    
    Float64 -.-> E64["Error: ~10⁻¹⁶"]
    Float32 -.-> E32["Error: ~10⁻⁷"]
    Float16 -.-> E16["Error: ~10⁻³"]
    BFloat16 -.-> EB16["Error: ~10⁻²"]
    FP8 -.-> E8["Error: ~10⁻¹"]
    NVFP4 -.-> E4["Error: ~10⁰"]
    Int4 -.-> EI4["Error: ~10¹"]
    
    style Float64 fill:#e8f5e9
    style Float32 fill:#fff9c4
    style Float16 fill:#ffe0b2
    style BFloat16 fill:#ffccbc
    style FP8 fill:#ffb3a0
    style NVFP4 fill:#ff9980
    style Int4 fill:#ff8066

The race to the bottom continues with FP8 and the newly introduced NVFP4 format from NVIDIA (June 2024), pushing precision boundaries even further. Modern AI models increasingly use these ultra-low precision formats for speed, trading mathematical accuracy for performance. Each operation compounds these errors:

Total Error = Operations × Precision Error × Batch Variance

With billions of operations per inference, small differences cascade into completely different outputs.

Real-World Impact 🔗

This nondeterminism affects every AI system:

Research Reproducibility 🔗

Same code, same data, different results
Papers become impossible to verify
Scientific method breaks down

Production Systems 🔗

A/B tests give misleading results (load affects outcomes)
Model behavior changes with traffic patterns
Debugging becomes a nightmare

Training vs Inference 🔗

Models trained with one batch size
Deployed with variable batch sizes
Performance degrades unpredictably

Personal Experience: The Optimization Trap 🔗

I find this problem particularly fascinating because I’ve been experiencing it firsthand while tackling GPU MODE performance optimization challenges like trimul. Especially when switching from PyTorch to Triton to optimize specific parts of the computation, I’ve struggled countless times to maintain correctness without breaking the benchmark tests.

Every optimization introduces subtle numerical differences. You think you’ve made the code faster, but suddenly your tests fail—not because your algorithm is wrong, but because your optimized version computes the same math in a different order. The benchmark expects bit-for-bit identical results, but floating-point arithmetic doesn’t cooperate.

It’s a constant battle: do you optimize for speed and accept numerical drift, or maintain perfect reproducibility at the cost of performance? In competitive optimization challenges, you need both—and that’s where things get really interesting.

The Dillema 🔗

This isn’t a implementation bug we can patch. It’s a fundamental trade-off between:

Mathematical correctness: Batch-invariant kernels that guarantee identical results
Computational efficiency: Current optimized kernels that maximize throughput

What This Means 🔗

We’re building AGI on foundations that violate basic mathematical properties. Every transformer model, every ChatGPT response, every AI decision inherits this nondeterminism.

The next time an AI gives you a different answer to the same question, remember: it’s not thinking differently. It’s just adding numbers in a different order.

And in the world of floating-point arithmetic, that makes all the difference.

Inspired by Thinking Machines' research on defeating nondeterminism in LLM inference, which revealed that batch size dependency—not concurrency or randomness—is the primary source of AI unpredictability.