07 — Performance
Speed Benchmarks
All times measured on an Intel Core i7-1165G7 at 2.8 GHz (single thread).
| Operation | ML-KEM-512 | ML-KEM-768 | ML-KEM-1024 | RSA-2048 | ECDH P-256 |
|---|---|---|---|---|---|
| KeyGen | 45 µs | 78 µs | 120 µs | 5,200 µs | 52 µs |
| Encapsulate | 65 µs | 95 µs | 145 µs | 150 µs* | 98 µs |
| Decapsulate | 72 µs | 108 µs | 168 µs | 4,800 µs | 52 µs |
| Total handshake | 182 µs | 281 µs | 433 µs | 10,150 µs | 202 µs |
*RSA "encapsulate" = encryption with public key
What This Means
- ML-KEM is ~35x faster than RSA-2048 for key exchange
- ML-KEM is comparable to ECDH in total handshake time
- The speed difference between ML-KEM levels is small — security is cheap
Size Benchmarks
| ML-KEM-512 | ML-KEM-768 | ML-KEM-1024 | RSA-2048 | ECDH P-256 | |
|---|---|---|---|---|---|
| Public key | 800 B | 1,184 B | 1,568 B | 256 B | 32 B |
| Ciphertext/Key share | 768 B | 1,088 B | 1,568 B | 256 B | 32 B |
| Total handshake data | 1,568 B | 2,272 B | 3,136 B | 512 B | 64 B |
Impact on Protocols
| Protocol | Typical payload | Handshake overhead | ML-KEM-768 impact |
|---|---|---|---|
| TLS 1.3 | ~2–4 KB | +2.3 KB | Acceptable |
| HTTP/2 | ~500 B–20 KB | +2.3 KB | Negligible for most requests |
| DNS over TLS | ~300 B | +2.3 KB | Significant; consider ML-KEM-512 |
| IoT (MQTT) | ~50 B | +2.3 KB | Large; consider ML-KEM-512 or FN-DSA for auth |
| VPN (WireGuard) | Variable | +2.3 KB per handshake | Acceptable; handshake is rare |
Throughput: Many Connections
| Scenario | ECDH P-256 | ML-KEM-768 | ML-KEM-1024 |
|---|---|---|---|
| Handshakes/second (single core) | ~5,000 | ~3,500 | ~2,300 |
| Handshakes/second (8 cores) | ~40,000 | ~28,000 | ~18,000 |
| Latency at 99th percentile | +0.2 ms | +0.3 ms | +0.5 ms |
Optimisations
AVX2 Vectorisation
ML-KEM's polynomial operations (NTT, base conversion, sampling) vectorise well:
- AVX2 (2013+ CPUs): ~2× speedup over scalar
- AVX-512 (2017+ CPUs): ~3× speedup
ARM NEON
Mobile and embedded processors (ARM Cortex-A53 through Apple M-series):
- NEON vector instructions: ~1.5× speedup
- Apple M1/M2/M3: ML-KEM-768 handshake in ~120 µs
Constant-Time Implementations
Security-critical: ML-KEM must run in constant time to prevent timing attacks:
- Reject sampling loops must not leak iteration counts
- Polynomial comparison must use bitwise operations, not early-exit
- Reference implementations (pqm4, liboqs) include constant-time variants
Memory Usage
| Operation | ML-KEM-512 | ML-KEM-768 | ML-KEM-1024 |
|---|---|---|---|
| Stack (KeyGen) | ~6 KB | ~8 KB | ~10 KB |
| Stack (Encaps) | ~5 KB | ~7 KB | ~9 KB |
| Stack (Decaps) | ~6 KB | ~9 KB | ~12 KB |
| Heap (none) | 0 | 0 | 0 |
ML-KEM is stack-only — no dynamic allocation needed. Ideal for embedded systems and kernel crypto.
Resources
- pqm4 benchmarks — ARM Cortex-M4 embedded benchmarks
- liboqs — Optimised C implementations with AVX2/NEON
- SUPERCOP — Standardised crypto benchmarking framework