04 — Fast Fourier Transform
The Problem: Polynomial Multiplication
In lattice cryptography, we multiply polynomials constantly:
Where and have coefficients (e.g., or ).
Naive Multiplication
For each coefficient of the result:
Complexity: O(n²) operations
- n=512: ~262,000 multiplications
- n=1,024: ~1,048,576 multiplications
The NTT Speed-Up
The Number Theoretic Transform (NTT) reduces this to O(n log n):
- n=512: ~4,608 operations (57× faster)
- n=1,024: ~10,240 operations (102× faster)
How NTT Works
The Analogy: Change of Basis
Imagine adding two vectors:
- In standard basis: Add each pair of coordinates
- In a "smart" basis: If the vectors are "diagonal," just add the diagonal elements
NTT is a "smart basis" for polynomials where:
- Multiplication becomes pointwise (element-by-element)
- Addition stays element-by-element
- Convolution (circular) becomes pointwise multiplication
The Math
The NTT evaluates the polynomial at special points — the n-th roots of unity modulo q:
Where ω is a primitive n-th root of unity (a number such that ωn ≡ 1 mod q, and ωk ≠ 1 for k < n).
Key Properties
- Forward NTT: Convert polynomial coefficients to "frequency" domain
- Pointwise multiplication: Multiply corresponding elements
- Inverse NTT: Convert back to coefficient domain
Where ∘ is pointwise multiplication (Hadamard product).
The "Butterfly" Operation
The FFT/NTT algorithm recursively splits the computation:
Input: [a0, a1, a2, a3, a4, a5, a6, a7]
Step 1: Split even/odd indices
Even: [a0, a2, a4, a6]
Odd: [a1, a3, a5, a7]
Step 2: Recursively NTT each half
NTT([a0, a2, a4, a6]) -> [E0, E1, E2, E3]
NTT([a1, a3, a5, a7]) -> [O0, O1, O2, O3]
Step 3: Combine with twiddle factors
Result[k] = E_k + ω^k · O_k
Result[k + n/2] = E_k - ω^k · O_k
This "butterfly" structure requires only O(n log n) operations.
NTT in FN-DSA
FN-DSA uses NTT for multiple operations:
| Operation | Without NTT | With NTT | Speedup |
|---|---|---|---|
| Key generation | O(n³) | O(n² log n) | ~100× |
| Trapdoor sampling | O(n³) | O(n² log n) | ~50× |
| Signature verification | O(n²) | O(n log n) | ~50× |
| Public key computation | O(n²) | O(n log n) | ~50× |
Why FN-DSA Is "Fast Fourier"
The name "FALCON" (Fast-Fourier Lattice-based Compact Signatures over NTRU) highlights the centrality of FFT/NTT:
- Fast-Fourier: The NTT is a finite-field version of the FFT
- Lattice-based: The mathematical foundation
- Compact signatures: Achieved through tight Gaussian sampling (enabled by NTT)
- NTRU: The specific lattice family
Concrete Speed Numbers
| Operation | FN-DSA-512 | FN-DSA-1024 | Platform |
|---|---|---|---|
| KeyGen | ~0.3 ms | ~0.7 ms | Intel Core i7 |
| Sign | ~0.5 ms | ~1.2 ms | Intel Core i7 |
| Verify | ~0.08 ms | ~0.15 ms | Intel Core i7 |
| KeyGen | ~2 ms | ~5 ms | ARM Cortex-A72 |
| Sign | ~3 ms | ~8 ms | ARM Cortex-A72 |
| Verify | ~0.5 ms | ~1 ms | ARM Cortex-A72 |
Verification is very fast — comparable to ECDSA and faster than ML-DSA.
NTT vs. Classical FFT
| Property | Classical FFT (complex numbers) | NTT (finite fields) |
|---|---|---|
| Domain | Complex numbers ℂ | Integers modulo prime q |
| Roots of unity | e^(2πi/n) | Primitive root mod q |
| Precision | Floating-point (rounding errors) | Exact integer arithmetic |
| Applicability | Signal processing, ML | Cryptography |
| Speed | Very fast (hardware optimised) | Fast (bitwise mod operations) |
Critical difference: NTT uses exact arithmetic — no rounding errors. This is essential for cryptography where every bit matters.
Implementation Challenges
Finding Good Primes
For NTT to work, the modulus q must support primitive n-th roots of unity:
- q ≡ 1 (mod n)
- q is prime
FN-DSA uses q = 12,289:
- 12,289 = 2^12 × 3 + 1 = 4,096 × 3 + 1
- Supports NTT up to n = 4,096 (more than enough for n=512 or 1,024)
- Fits in 16-bit integers (efficient on embedded processors)
Memory Layout
NTT requires in-place computation (no extra memory):
Input array (n elements)
|
v
Butterfly stages (log₂(n) stages)
|
v
Output array (same memory as input)
This is ideal for embedded systems with limited RAM.
Vectorisation
Modern CPUs support SIMD (Single Instruction, Multiple Data):
- AVX2 (Intel): 256-bit vectors → process 16 coefficients simultaneously
- NEON (ARM): 128-bit vectors → process 8 coefficients simultaneously
- AVX-512 (Intel): 512-bit vectors → process 32 coefficients simultaneously
Speedup: 8–16× over scalar implementations.
Resources
- Cooley & Tukey, "An algorithm for the machine calculation of complex Fourier series" (1965), Math. Comp. — Original FFT paper
- Nussbaumer, "Fast Fourier Transform and Convolution Algorithms" (1981), Springer
- Pöppelmann et al., "High-performance ideal lattice-based cryptography on 8-bit ATxmega microcontrollers" (2015), LATINCRYPT
- NIST FIPS 206, Section 6: Fast Fourier Transform in FN-DSA