365 Architect

04 — Fast Fourier Transform

The Problem: Polynomial Multiplication

In lattice cryptography, we multiply polynomials constantly:

c(x)=a(x)b(x)(modxn+1)\displaystyle c(x) = a(x) \cdot b(x) \pmod{x^n + 1}

Where a(x)a(x) and b(x)b(x) have nn coefficients (e.g., n=512n = 512 or 1,0241,024).

Naive Multiplication

For each coefficient of the result:

ck=i=0kaibki(modq)\displaystyle c_k = \sum_{i=0}^{k} a_i \cdot b_{k-i} \pmod q

Complexity: O(n²) operations

  • n=512: ~262,000 multiplications
  • n=1,024: ~1,048,576 multiplications

The NTT Speed-Up

The Number Theoretic Transform (NTT) reduces this to O(n log n):

  • n=512: ~4,608 operations (57× faster)
  • n=1,024: ~10,240 operations (102× faster)

How NTT Works

The Analogy: Change of Basis

Imagine adding two vectors:

  • In standard basis: Add each pair of coordinates
  • In a "smart" basis: If the vectors are "diagonal," just add the diagonal elements

NTT is a "smart basis" for polynomials where:

  • Multiplication becomes pointwise (element-by-element)
  • Addition stays element-by-element
  • Convolution (circular) becomes pointwise multiplication

The Math

The NTT evaluates the polynomial at special points — the n-th roots of unity modulo q:

a^k=j=0n1ajωjk(modq)\displaystyle \hat{a}_k = \sum_{j=0}^{n-1} a_j \cdot \omega^{j \cdot k} \pmod q

Where ω is a primitive n-th root of unity (a number such that ωn ≡ 1 mod q, and ωk ≠ 1 for k < n).

Key Properties

  1. Forward NTT: Convert polynomial coefficients to "frequency" domain
  2. Pointwise multiplication: Multiply corresponding elements
  3. Inverse NTT: Convert back to coefficient domain

c=INTT(NTT(a)NTT(b))\displaystyle c = \text{INTT}(\text{NTT}(a) \circ \text{NTT}(b))

Where is pointwise multiplication (Hadamard product).

The "Butterfly" Operation

The FFT/NTT algorithm recursively splits the computation:

Input: [a0, a1, a2, a3, a4, a5, a6, a7]

Step 1: Split even/odd indices
  Even: [a0, a2, a4, a6]
  Odd:  [a1, a3, a5, a7]

Step 2: Recursively NTT each half
  NTT([a0, a2, a4, a6]) -> [E0, E1, E2, E3]
  NTT([a1, a3, a5, a7]) -> [O0, O1, O2, O3]

Step 3: Combine with twiddle factors
  Result[k] = E_k + ω^k · O_k
  Result[k + n/2] = E_k - ω^k · O_k

This "butterfly" structure requires only O(n log n) operations.

NTT in FN-DSA

FN-DSA uses NTT for multiple operations:

Operation Without NTT With NTT Speedup
Key generation O(n³) O(n² log n) ~100×
Trapdoor sampling O(n³) O(n² log n) ~50×
Signature verification O(n²) O(n log n) ~50×
Public key computation O(n²) O(n log n) ~50×

Why FN-DSA Is "Fast Fourier"

The name "FALCON" (Fast-Fourier Lattice-based Compact Signatures over NTRU) highlights the centrality of FFT/NTT:

  • Fast-Fourier: The NTT is a finite-field version of the FFT
  • Lattice-based: The mathematical foundation
  • Compact signatures: Achieved through tight Gaussian sampling (enabled by NTT)
  • NTRU: The specific lattice family

Concrete Speed Numbers

Operation FN-DSA-512 FN-DSA-1024 Platform
KeyGen ~0.3 ms ~0.7 ms Intel Core i7
Sign ~0.5 ms ~1.2 ms Intel Core i7
Verify ~0.08 ms ~0.15 ms Intel Core i7
KeyGen ~2 ms ~5 ms ARM Cortex-A72
Sign ~3 ms ~8 ms ARM Cortex-A72
Verify ~0.5 ms ~1 ms ARM Cortex-A72

Verification is very fast — comparable to ECDSA and faster than ML-DSA.

NTT vs. Classical FFT

Property Classical FFT (complex numbers) NTT (finite fields)
Domain Complex numbers ℂ Integers modulo prime q
Roots of unity e^(2πi/n) Primitive root mod q
Precision Floating-point (rounding errors) Exact integer arithmetic
Applicability Signal processing, ML Cryptography
Speed Very fast (hardware optimised) Fast (bitwise mod operations)

Critical difference: NTT uses exact arithmetic — no rounding errors. This is essential for cryptography where every bit matters.

Implementation Challenges

Finding Good Primes

For NTT to work, the modulus q must support primitive n-th roots of unity:

  • q ≡ 1 (mod n)
  • q is prime

FN-DSA uses q = 12,289:

  • 12,289 = 2^12 × 3 + 1 = 4,096 × 3 + 1
  • Supports NTT up to n = 4,096 (more than enough for n=512 or 1,024)
  • Fits in 16-bit integers (efficient on embedded processors)

Memory Layout

NTT requires in-place computation (no extra memory):

Input array (n elements)
  |
  v
Butterfly stages (log₂(n) stages)
  |
  v
Output array (same memory as input)

This is ideal for embedded systems with limited RAM.

Vectorisation

Modern CPUs support SIMD (Single Instruction, Multiple Data):

  • AVX2 (Intel): 256-bit vectors → process 16 coefficients simultaneously
  • NEON (ARM): 128-bit vectors → process 8 coefficients simultaneously
  • AVX-512 (Intel): 512-bit vectors → process 32 coefficients simultaneously

Speedup: 8–16× over scalar implementations.

Resources

  • Cooley & Tukey, "An algorithm for the machine calculation of complex Fourier series" (1965), Math. Comp. — Original FFT paper
  • Nussbaumer, "Fast Fourier Transform and Convolution Algorithms" (1981), Springer
  • Pöppelmann et al., "High-performance ideal lattice-based cryptography on 8-bit ATxmega microcontrollers" (2015), LATINCRYPT
  • NIST FIPS 206, Section 6: Fast Fourier Transform in FN-DSA
Share on LinkedIn