Live Document

5 min read935 wordsv1.0.1.0Last Updated: 29 Jul 2026

04 — Fast Fourier Transform

The Problem: Polynomial Multiplication

In lattice cryptography, we multiply polynomials constantly:

$c(x) = a(x) \cdot b(x) \pmod{x^n + 1}$

Where $a(x)$ and $b(x)$ have $n$ coefficients (e.g., $n = 512$ or $1,024$ ).

Naive Multiplication

For each coefficient of the result:

$c_k = \sum_{i=0}^{k} a_i \cdot b_{k-i} \pmod q$

Complexity: O(n²) operations

n=512: ~262,000 multiplications
n=1,024: ~1,048,576 multiplications

The NTT Speed-Up

The Number Theoretic Transform (NTT) reduces this to O(n log n):

n=512: ~4,608 operations (57× faster)
n=1,024: ~10,240 operations (102× faster)

How NTT Works

The Analogy: Change of Basis

Imagine adding two vectors:

In standard basis: Add each pair of coordinates
In a "smart" basis: If the vectors are "diagonal," just add the diagonal elements

NTT is a "smart basis" for polynomials where:

Multiplication becomes pointwise (element-by-element)
Addition stays element-by-element
Convolution (circular) becomes pointwise multiplication

The Math

The NTT evaluates the polynomial at special points — the n-th roots of unity modulo q:

$\hat{a}_k = \sum_{j=0}^{n-1} a_j \cdot \omega^{j \cdot k} \pmod q$

Where ω is a primitive n-th root of unity (a number such that ω^{n ≡ 1 mod q, and ω}k ≠ 1 for k < n).

Key Properties

Forward NTT: Convert polynomial coefficients to "frequency" domain
Pointwise multiplication: Multiply corresponding elements
Inverse NTT: Convert back to coefficient domain

$c = \text{INTT}(\text{NTT}(a) \circ \text{NTT}(b))$

Where ∘ is pointwise multiplication (Hadamard product).

The "Butterfly" Operation

The FFT/NTT algorithm recursively splits the computation:

Input: [a0, a1, a2, a3, a4, a5, a6, a7]

Step 1: Split even/odd indices
  Even: [a0, a2, a4, a6]
  Odd:  [a1, a3, a5, a7]

Step 2: Recursively NTT each half
  NTT([a0, a2, a4, a6]) -> [E0, E1, E2, E3]
  NTT([a1, a3, a5, a7]) -> [O0, O1, O2, O3]

Step 3: Combine with twiddle factors
  Result[k] = E_k + ω^k · O_k
  Result[k + n/2] = E_k - ω^k · O_k

This "butterfly" structure requires only O(n log n) operations.

NTT in FN-DSA

FN-DSA uses NTT for multiple operations:

Operation	Without NTT	With NTT	Speedup
Key generation	O(n³)	O(n² log n)	~100×
Trapdoor sampling	O(n³)	O(n² log n)	~50×
Signature verification	O(n²)	O(n log n)	~50×
Public key computation	O(n²)	O(n log n)	~50×

Why FN-DSA Is "Fast Fourier"

The name "FALCON" (Fast-Fourier Lattice-based Compact Signatures over NTRU) highlights the centrality of FFT/NTT:

Fast-Fourier: The NTT is a finite-field version of the FFT
Lattice-based: The mathematical foundation
Compact signatures: Achieved through tight Gaussian sampling (enabled by NTT)
NTRU: The specific lattice family

Concrete Speed Numbers

Operation	FN-DSA-512	FN-DSA-1024	Platform
KeyGen	~0.3 ms	~0.7 ms	Intel Core i7
Sign	~0.5 ms	~1.2 ms	Intel Core i7
Verify	~0.08 ms	~0.15 ms	Intel Core i7
KeyGen	~2 ms	~5 ms	ARM Cortex-A72
Sign	~3 ms	~8 ms	ARM Cortex-A72
Verify	~0.5 ms	~1 ms	ARM Cortex-A72

Verification is very fast — comparable to ECDSA and faster than ML-DSA.

NTT vs. Classical FFT

Property	Classical FFT (complex numbers)	NTT (finite fields)
Domain	Complex numbers ℂ	Integers modulo prime q
Roots of unity	e^(2πi/n)	Primitive root mod q
Precision	Floating-point (rounding errors)	Exact integer arithmetic
Applicability	Signal processing, ML	Cryptography
Speed	Very fast (hardware optimised)	Fast (bitwise mod operations)

Critical difference: NTT uses exact arithmetic — no rounding errors. This is essential for cryptography where every bit matters.

Implementation Challenges

Finding Good Primes

For NTT to work, the modulus q must support primitive n-th roots of unity:

q ≡ 1 (mod n)
q is prime

FN-DSA uses q = 12,289:

12,289 = 2^12 × 3 + 1 = 4,096 × 3 + 1
Supports NTT up to n = 4,096 (more than enough for n=512 or 1,024)
Fits in 16-bit integers (efficient on embedded processors)

Memory Layout

NTT requires in-place computation (no extra memory):

Input array (n elements)
  |
  v
Butterfly stages (log₂(n) stages)
  |
  v
Output array (same memory as input)

This is ideal for embedded systems with limited RAM.

Vectorisation

Modern CPUs support SIMD (Single Instruction, Multiple Data):

AVX2 (Intel): 256-bit vectors → process 16 coefficients simultaneously
NEON (ARM): 128-bit vectors → process 8 coefficients simultaneously
AVX-512 (Intel): 512-bit vectors → process 32 coefficients simultaneously

Speedup: 8–16× over scalar implementations.

Resources

Cooley & Tukey, "An algorithm for the machine calculation of complex Fourier series" (1965), Math. Comp. — Original FFT paper
Nussbaumer, "Fast Fourier Transform and Convolution Algorithms" (1981), Springer
Pöppelmann et al., "High-performance ideal lattice-based cryptography on 8-bit ATxmega microcontrollers" (2015), LATINCRYPT
NIST FIPS 206, Section 6: Fast Fourier Transform in FN-DSA