# The Hidden Math Bug That Makes AI Unpredictable URL: https://www.msuiche.com/posts/the-hidden-math-bug-that-makes-ai-unpredictable/ Date: 2025-09-14 Author: Matt Suiche Tags: determinism, floating-point, neural-networks, pytorch, mlx --- This [tweet from Awni Hannun](https://x.com/awnihannun/status/1966953027451118012) demonstrates in one line of MLX code the nondeterminism phenomenon detailed in [Thinking Machines' research](https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/). We will explore the PyTorch equivalent that reveals a fundamental issue in AI systems, because I've found that tweet extremely helpful to understand what the original blogpost was about.

Here's a one-line code summary in MLX of the @thinkymachines blog post on non-determinism in LLM inference.

I'd guess the difference is larger the lower the precision, as you get larger affects from non-associativity of FP math.

Interestingly, that implies that training at low… pic.twitter.com/jYcDK9GiLn
— Awni Hannun (@awnihannun) September 13, 2025

## The One-Line Experiment ```python torch.all(A @ B == torch.stack([a @ B for a in A])) ``` Here's a simple test that exposes why your AI model gives different answers to the same question. You can also refer to the [modal.com's notebook I wrote if you want to test it](https://modal.com/notebooks/msuiche/_/nb-SrLWDcHNB6eso2hcmcle9f). ```python import torch torch.manual_seed(42) A = torch.randn(128, 256, dtype=torch.bfloat16, device='cuda') B = torch.randn(256, 512, dtype=torch.bfloat16, device='cuda') batched = A @ B sequential = torch.stack([a @ B for a in A]) print("Are they equal?", torch.all(batched == sequential).item()) print("Max difference:", (batched - sequential).abs().max().item()) print("Mean difference:", (batched - sequential).abs().mean().item()) # Are they equal? False # Max difference: 0.001953125 # Mean difference: 6.007030606269836e-08 ``` Same mathematical operation. Different results. This isn't a bug—it's how modern ML frameworks work. ## The Root Cause When GPUs process matrix multiplications, they optimize differently based on batch size: - **Batched operation** (`A @ B`): Uses parallel reduction algorithms - **Sequential operation** (`[a @ B for a in A]`): Processes each matrix individually Different computation orders lead to different floating-point rounding errors. Think of it like adding numbers in different orders: - `(0.1 + 1e20) - 1e20 = 0` - `0.1 + (1e20 - 1e20) = 0.1` Same math, different results. This is referred as `floating-point non-associativity` in the original blogpost. ## Why Your ChatGPT Responses Vary Ever wonder why ChatGPT gives slightly different answers to identical prompts, even with temperature set to 0? It's not randomness in the model—it's batch size variability: 1. **Morning (low traffic)**: Your query processes in a small batch 2. **Peak hours (high traffic)**: Your query joins a large batch 3. **Different batch size** = Different computation path = Different result In tests with a 235B parameter model (`Qwen/Qwen3-235B-A22B-Instruct-2507`), Thinking Machines researchers found **80 unique outputs** from 1000 identical requests. The responses matched for 102 tokens, then diverged purely due to computational differences. This might explain why [some users report that Claude Code feels less reliable during daytime hours](https://x.com/TheAhmadOsman/status/1961326485672772040)—when server load is higher, batch sizes change, leading to subtly different model behaviors:

some of anthropic rugpulls so far:

> 5 years retention of all conversations and code, all data will be used for training

> 1.58-bit quantized models during daytime

> plus not getting opus 4 in claude code

> max plans limits cut in half 6 weeks ago, no comms

> weekly limits… pic.twitter.com/Ss18u4x5Dz
— Ahmad (@TheAhmadOsman) August 29, 2025

## The Precision Cascade The problem worsens as we push for efficiency with lower precision: ```mermaid graph TD Float64["float64
53-bit mantissa"] --> Float32["float32
23-bit mantissa"] Float32 --> Float16["float16
10-bit mantissa"] Float16 --> BFloat16["bfloat16
7-bit mantissa"] BFloat16 --> FP8["FP8 (E4M3/E5M2)
3-2 bit mantissa"] FP8 --> NVFP4["NVFP4
2-bit mantissa
(June 2024)"] NVFP4 --> Int4["int4
No decimals"] Float64 -.-> E64["Error: ~10⁻¹⁶"] Float32 -.-> E32["Error: ~10⁻⁷"] Float16 -.-> E16["Error: ~10⁻³"] BFloat16 -.-> EB16["Error: ~10⁻²"] FP8 -.-> E8["Error: ~10⁻¹"] NVFP4 -.-> E4["Error: ~10⁰"] Int4 -.-> EI4["Error: ~10¹"] style Float64 fill:#e8f5e9 style Float32 fill:#fff9c4 style Float16 fill:#ffe0b2 style BFloat16 fill:#ffccbc style FP8 fill:#ffb3a0 style NVFP4 fill:#ff9980 style Int4 fill:#ff8066 ``` The race to the bottom continues with FP8 and the newly introduced [NVFP4 format from NVIDIA (June 2024)](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/), pushing precision boundaries even further. Modern AI models increasingly use these ultra-low precision formats for speed, trading mathematical accuracy for performance. Each operation compounds these errors: **Total Error = Operations × Precision Error × Batch Variance** With billions of operations per inference, small differences cascade into completely different outputs. ## Real-World Impact This nondeterminism affects every AI system: ### Research Reproducibility - Same code, same data, different results - Papers become impossible to verify - Scientific method breaks down ### Production Systems - A/B tests give misleading results (load affects outcomes) - Model behavior changes with traffic patterns - Debugging becomes a nightmare ### Training vs Inference - Models trained with one batch size - Deployed with variable batch sizes - Performance degrades unpredictably ## Personal Experience: The Optimization Trap I find this problem particularly fascinating because I've been experiencing it firsthand while tackling [GPU MODE performance optimization challenges like trimul](https://www.gpumode.com/v2/leaderboard/496?tab=rankings). Especially when switching from PyTorch to Triton to optimize specific parts of the computation, I've struggled countless times to maintain correctness without breaking the benchmark tests. Every optimization introduces subtle numerical differences. You think you've made the code faster, but suddenly your tests fail—not because your algorithm is wrong, but because your optimized version computes the same math in a different order. The benchmark expects bit-for-bit identical results, but floating-point arithmetic doesn't cooperate. It's a constant battle: do you optimize for speed and accept numerical drift, or maintain perfect reproducibility at the cost of performance? In competitive optimization challenges, you need both—and that's where things get really interesting. ## The Dillema This isn't a implementation bug we can patch. It's a fundamental trade-off between: - **Mathematical correctness**: Batch-invariant kernels that guarantee identical results - **Computational efficiency**: Current optimized kernels that maximize throughput ## What This Means We're building AGI on foundations that violate basic mathematical properties. Every transformer model, every ChatGPT response, every AI decision inherits this nondeterminism. The next time an AI gives you a different answer to the same question, remember: it's not thinking differently. It's just adding numbers in a different order. And in the world of floating-point arithmetic, that makes all the difference. --- *Inspired by [Thinking Machines' research on defeating nondeterminism in LLM inference](https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/), which revealed that batch size dependency—not concurrency or randomness—is the primary source of AI unpredictability.*