365 Architect

07 — Performance

Speed Benchmarks

All times measured on an Intel Core i7-1165G7 at 2.8 GHz (single thread).

Operation ML-KEM-512 ML-KEM-768 ML-KEM-1024 RSA-2048 ECDH P-256
KeyGen 45 µs 78 µs 120 µs 5,200 µs 52 µs
Encapsulate 65 µs 95 µs 145 µs 150 µs* 98 µs
Decapsulate 72 µs 108 µs 168 µs 4,800 µs 52 µs
Total handshake 182 µs 281 µs 433 µs 10,150 µs 202 µs

*RSA "encapsulate" = encryption with public key

What This Means

  • ML-KEM is ~35x faster than RSA-2048 for key exchange
  • ML-KEM is comparable to ECDH in total handshake time
  • The speed difference between ML-KEM levels is small — security is cheap

Size Benchmarks

ML-KEM-512 ML-KEM-768 ML-KEM-1024 RSA-2048 ECDH P-256
Public key 800 B 1,184 B 1,568 B 256 B 32 B
Ciphertext/Key share 768 B 1,088 B 1,568 B 256 B 32 B
Total handshake data 1,568 B 2,272 B 3,136 B 512 B 64 B

Impact on Protocols

Protocol Typical payload Handshake overhead ML-KEM-768 impact
TLS 1.3 ~2–4 KB +2.3 KB Acceptable
HTTP/2 ~500 B–20 KB +2.3 KB Negligible for most requests
DNS over TLS ~300 B +2.3 KB Significant; consider ML-KEM-512
IoT (MQTT) ~50 B +2.3 KB Large; consider ML-KEM-512 or FN-DSA for auth
VPN (WireGuard) Variable +2.3 KB per handshake Acceptable; handshake is rare

Throughput: Many Connections

Scenario ECDH P-256 ML-KEM-768 ML-KEM-1024
Handshakes/second (single core) ~5,000 ~3,500 ~2,300
Handshakes/second (8 cores) ~40,000 ~28,000 ~18,000
Latency at 99th percentile +0.2 ms +0.3 ms +0.5 ms

Optimisations

AVX2 Vectorisation

ML-KEM's polynomial operations (NTT, base conversion, sampling) vectorise well:

  • AVX2 (2013+ CPUs): ~2× speedup over scalar
  • AVX-512 (2017+ CPUs): ~3× speedup

ARM NEON

Mobile and embedded processors (ARM Cortex-A53 through Apple M-series):

  • NEON vector instructions: ~1.5× speedup
  • Apple M1/M2/M3: ML-KEM-768 handshake in ~120 µs

Constant-Time Implementations

Security-critical: ML-KEM must run in constant time to prevent timing attacks:

  • Reject sampling loops must not leak iteration counts
  • Polynomial comparison must use bitwise operations, not early-exit
  • Reference implementations (pqm4, liboqs) include constant-time variants

Memory Usage

Operation ML-KEM-512 ML-KEM-768 ML-KEM-1024
Stack (KeyGen) ~6 KB ~8 KB ~10 KB
Stack (Encaps) ~5 KB ~7 KB ~9 KB
Stack (Decaps) ~6 KB ~9 KB ~12 KB
Heap (none) 0 0 0

ML-KEM is stack-only — no dynamic allocation needed. Ideal for embedded systems and kernel crypto.

Resources

  • pqm4 benchmarks — ARM Cortex-M4 embedded benchmarks
  • liboqs — Optimised C implementations with AVX2/NEON
  • SUPERCOP — Standardised crypto benchmarking framework
Share on LinkedIn