RustBPE: High-Performance BPE Tokenizer Training in Rust

ยท 1452 words ยท 7 minute read

Introduction ๐Ÿ”—

Byte Pair Encoding (BPE) tokenization is used in modern language models, but efficient training implementations are limited. OpenAI’s tiktoken handles inference well, while HuggingFace’s tokenizers supports training but has complexity and overhead. RustBPE is a Rust implementation that provides training capabilities with better performance.

RustBPE was developed by Andrej Karpathy as part of the nanochat project. This analysis covers the RustBPE implementation, including its architecture, performance characteristics, and Python integration.

For those interested in understanding BPE implementation from first principles, Sebastian Raschka provides an excellent deep-dive into implementing BPE from scratch in his blogpost, and this is also covered in his book “Build a Large Language Model (From Scratch)”. His work offers invaluable insights into the algorithmic foundations that underpin implementations like RustBPE.

How BPE Works ๐Ÿ”—

Byte Pair Encoding (BPE) builds vocabulary by iteratively merging the most frequent character pairs:

graph TD
    A[Input Text] --> B[Tokenize to Characters]
    B --> C[Count All Adjacent Pairs]
    C --> D[Find Most Frequent Pair]
    D --> E[Merge Pair into New Token]
    E --> F{Reached Target Vocab Size?}
    F -->|No| C
    F -->|Yes| G[Final Vocabulary]
    
    style A fill:#e1f5ff
    style G fill:#ccffcc
    
    subgraph "Example: 'low lower lowest'"
        H["l o w<br/>l o w e r<br/>l o w e s t"] --> I["Count pairs: 'lo':2, 'ow':3, 'we':2, 'er':1, 'es':1, 'st':1"]
        I --> J["Merge 'ow' โ†’ 'low'"]
        J --> K["l ow<br/>l ow er<br/>l ow est"]
        K --> L["New tokens: 'low'"]
    end

The algorithm continues until reaching the desired vocabulary size, building tokens from characters up to full words.

The Problem Space ๐Ÿ”—

Existing Solutions ๐Ÿ”—

  1. tiktoken: Inference-only, no training capabilities
  2. HuggingFace tokenizers: Full-featured but complex with overhead
  3. minbpe: Pure Python, simple but inefficient for large datasets

The RustBPE Approach ๐Ÿ”—

  • Training: Rust implementation with parallel processing
  • Inference: Export to tiktoken format
  • Design: Minimal codebase without unnecessary complexity

RustBPE Architecture ๐Ÿ”—

graph TD
    A[Text Iterator] --> B[Buffer Collection]
    B --> C[Parallel Regex Processing]
    C --> D[Parallel Pair Counting]
    D --> E[Priority Queue with Merges]
    E --> F[Delta-Based Updates]
    F --> G{More Merges Needed?}
    G -->|Yes| D
    G -->|No| H[Export to Tiktoken Format]
    
    I[Rayon Thread Pool] --> C
    I --> D
    
    J[Compact Data Structures] --> D
    J --> E
    J --> F
    
    K[Lazy Heap Refresh] --> E
    
    style A fill:#e1f5ff
    style H fill:#ccffcc
    style I fill:#fff4cc
    style J fill:#ffe4e1
    style K fill:#f0f8ff
    
    subgraph "Performance Optimizations"
        L[โ€ข Parallel processing<br/>โ€ข Delta updates<br/>โ€ข Compact strings<br/>โ€ข Lazy refresh<br/>โ€ข Cache-efficient layout]
    end

RustBPE optimizes each stage with parallel processing, efficient data structures, and incremental updates to achieve better performance than naive implementations.

Architecture Overview ๐Ÿ”—

Core Components ๐Ÿ”—

pub struct Tokenizer {
    /// Maps pairs of token IDs to their merged token ID
    pub merges: StdHashMap<Pair, u32>,
    /// The regex pattern used for text splitting
    pub pattern: String,
    /// Compiled regex for efficiency
    compiled_pattern: Regex,
}

Key Data Structures ๐Ÿ”—

  1. Word: Represents a tokenized sequence with efficient pair iteration
  2. MergeJob: Priority queue entry for merge operations with frequency tracking
  3. Pair: Type alias for (u32, u32) representing token ID pairs

Algorithm Implementation ๐Ÿ”—

1. Text Preprocessing ๐Ÿ”—

RustBPE uses GPT-4 style regex pattern for text splitting:

const GPT4_PATTERN: &str = r"'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+";

This pattern handles:

  • Contractions (don’t, won’t, etc.)
  • Words and numbers
  • Whitespace and punctuation
  • Special characters

2. Parallel Training Pipeline ๐Ÿ”—

The training process is designed for maximum parallelization:

Streaming Iterator Pattern ๐Ÿ”—

pub fn train_from_iterator(
    &mut self,
    py: pyo3::Python<'_>,
    iterator: &pyo3::Bound<'_, pyo3::PyAny>,
    vocab_size: u32,
    buffer_size: usize,
    pattern: Option<String>,
) -> PyResult<()>

Optimizations:

  • Buffer-based processing to reduce GIL contention
  • Parallel regex matching with Rayon
  • Incremental pair counting with position tracking

Parallel Pair Counting ๐Ÿ”—

fn count_pairs_parallel(
    words: &[Word],
    counts: &[i32],
) -> (AHashMap<Pair, i32>, AHashMap<Pair, AHashSet<usize>>)

This function:

  • Processes words in parallel using Rayon’s par_iter()
  • Maintains local pair counts and position tracking
  • Reduces results efficiently using parallel reduction

3. Incremental Merge Algorithm ๐Ÿ”—

The core training loop uses several sophisticated optimizations:

Priority Queue with Lazy Refresh ๐Ÿ”—

let mut heap = OctonaryHeap::with_capacity(pair_counts.len());
  • Uses an 8-ary heap for better cache locality
  • Implements lazy refresh to avoid expensive heap rebuilds
  • Maintains deterministic merge order with tie-breaking

Delta-Based Updates ๐Ÿ”—

Instead of recomputing all pair counts after each merge, RustBPE tracks deltas:

fn merge_pair(&mut self, pair: Pair, new_id: u32) -> Vec<(Pair, i32)> {
    // Returns local pair-count deltas for THIS word only:
    //   -1 for removed pairs, +1 for newly created pairs
}

This approach:

  • Avoids HashMap operations in hot loops
  • Reduces memory allocations
  • Enables efficient parallel updates

4. Memory Efficiency ๐Ÿ”—

Compact String Usage ๐Ÿ”—

use compact_str::CompactString;
  • Stores small strings inline without heap allocation
  • Reduces memory overhead
  • Improves cache performance

Position Tracking Optimization ๐Ÿ”—

struct MergeJob {
    pair: Pair,
    count: u64,
    pos: AHashSet<usize>, // Only tracks affected word indices
}

Instead of tracking all occurrences, only stores indices of words containing each pair.

Performance Characteristics ๐Ÿ”—

Performance ๐Ÿ”—

Based on benchmarks, RustBPE provides:

  1. Training Speed: Faster than Python implementations
  2. Memory Usage: Lower memory footprint with efficient data structures
  3. Scalability: Parallel scaling with dataset size

Performance Features:

  1. Parallel Processing: Multi-core utilization
  2. Cache Efficiency: Optimized data structures
  3. Minimal Allocations: Reduced memory management overhead

Integration with Python Ecosystem ๐Ÿ”—

PyO3 Bindings ๐Ÿ”—

RustBPE uses PyO3 for seamless Python integration:

#[pymodule]
fn rustbpe(m: &Bound<'_, PyModule>) -> PyResult<()> {
    pyo3_log::init();
    m.add_class::<Tokenizer>()?;
    Ok(())
}

Tiktoken Export ๐Ÿ”—

The trained tokenizer can be exported to tiktoken format:

pub fn get_mergeable_ranks(&self) -> Vec<(Vec<u8>, u32)> {
    // Builds vocabulary incrementally from merges
    // Returns format compatible with tiktoken
}

This enables:

  • Fast inference with tiktoken
  • Compatibility with OpenAI tooling
  • Deployment in production environments

Usage Examples ๐Ÿ”—

Basic Training ๐Ÿ”—

import rustbpe

# Create tokenizer
tokenizer = rustbpe.Tokenizer()

# Train from text iterator
tokenizer.train_from_iterator(
    text_iterator, 
    vocab_size=2048,
    pattern=None  # Uses default GPT-4 pattern
)

Export to Tiktoken ๐Ÿ”—

# Get pattern and mergeable ranks
pattern = tokenizer.get_pattern()
mergeable_ranks_list = tokenizer.get_mergeable_ranks()
mergeable_ranks = {bytes(k): v for k, v in mergeable_ranks_list}

# Create tiktoken encoding
import tiktoken
enc = tiktoken.Encoding(
    name="rustbpe",
    pat_str=pattern,
    mergeable_ranks=mergeable_ranks,
    special_tokens={},
)

Encoding Text ๐Ÿ”—

# Encode with rustbpe
rustbpe_ids = tokenizer.encode("Hello, world!")

# Encode with tiktoken (faster for inference)
tiktoken_ids = enc.encode("Hello, world!")

Advanced Features ๐Ÿ”—

Custom Patterns ๐Ÿ”—

# Use custom regex pattern
custom_pattern = r"[^\s]+"  # Simple whitespace splitting
tokenizer.train_from_iterator(
    text_iterator,
    vocab_size=1024,
    pattern=custom_pattern
)

Streaming Training ๐Ÿ”—

def text_generator():
    with open("large_corpus.txt", "r") as f:
        for line in f:
            yield line.strip()

# Train on large datasets without loading everything into memory
tokenizer.train_from_iterator(
    text_generator(),
    vocab_size=4096,
    buffer_size=8192  # Tune based on memory constraints
)

Implementation Details ๐Ÿ”—

Dependencies ๐Ÿ”—

  • rayon: Data parallelism
  • dary_heap: Priority queues
  • fancy-regex: Unicode regex support
  • ahash: Fast hash implementations
  • compact_str: Memory-efficient string storage
  • pyo3: Python bindings

Memory Layout ๐Ÿ”—

Memory Layout:

  1. Sequential Access: Words stored contiguously for prefetching
  2. Small Types: u32 for token IDs to balance range and memory
  3. Inline Storage: Small strings stored inline when possible

Error Handling ๐Ÿ”—

Error Handling:

// Validate vocab size
assert!(vocab_size >= 256, "vocab_size must be at least 256");

// Regex compilation with error handling
self.compiled_pattern = Regex::new(&pattern_str)
    .map_err(|e| pyo3::exceptions::PyValueError::new_err(
        format!("Invalid regex pattern: {}", e)
    ))?;

Comparison with minbpe ๐Ÿ”—

Overview ๐Ÿ”—

minbpe is Andrej Karpathy’s educational implementation of BPE tokenization in pure Python. It serves as an excellent reference implementation and learning tool, but was designed primarily for clarity and educational purposes rather than production performance.

Architecture Differences ๐Ÿ”—

minbpe Design Philosophy ๐Ÿ”—

# minbpe prioritizes clarity and simplicity
def get_stats(ids, counts=None):
    counts = {} if counts is None else counts
    for pair in zip(ids, ids[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts

def merge(ids, pair, idx):
    newids = []
    i = 0
    while i < len(ids):
        if ids[i] == pair[0] and i < len(ids) - 1 and ids[i+1] == pair[1]:
            newids.append(idx)
            i += 2
        else:
            newids.append(ids[i])
            i += 1
    return newids

RustBPE Design Philosophy ๐Ÿ”—

// RustBPE prioritizes performance and efficiency
fn merge_pair(&mut self, pair: Pair, new_id: u32) -> Vec<(Pair, i32)> {
    // Returns local pair-count deltas for THIS word only:
    //   -1 for removed pairs, +1 for newly created pairs
    // NOTE: this version deliberately avoids a HashMap in the hot loop.
}

Key Differences ๐Ÿ”—

Aspect minbpe RustBPE
Language Pure Python Rust with Python bindings
Parallelism Single-threaded Multi-threaded (Rayon)
Memory Usage Higher (Python objects) Lower (compact structures)
Training Speed ~25 seconds for small dataset Significantly faster
Inference Python implementation Exports to tiktoken

Algorithm Differences:

  • minbpe: Naive pair counting, simple merging, educational focus
  • RustBPE: Incremental updates, delta tracking, parallel processing, production-optimized

Use Cases:

  • minbpe: Education, prototyping, small datasets, debugging
  • RustBPE: Production training, large datasets, performance-critical applications

The implementations serve different purposes: minbpe for learning BPE concepts, RustBPE for production deployment.

Conclusion ๐Ÿ”—

RustBPE by Andrej Karpathy provides:

  • Performance: Parallel processing and optimized algorithms faster than Python implementations
  • Simplicity: Clean implementation without unnecessary complexity
  • Compatibility: Integration with Python ecosystem and tiktoken format
  • Production-Ready: Complete implementation with error handling

For projects requiring efficient tokenizer training without HuggingFace’s complexity, RustBPE provides a solution that balances simplicity and performance. The implementation demonstrates how algorithm design and systems programming can create fast, maintainable ML infrastructure tools.


Analysis of Andrej Karpathy’s RustBPE implementation in the nanochat project.