Tera-Scale 1D FFT with Low-Communication Algorithm on Intel Xeon Phi Coprocessors

SESSION: Performance Analysis of Applications at Large Scale


AUTHOR(S):Jongsoo Park, Ganesh Bikshandi, Karthikeyan Vaidyanathan, Ping Tak Peter Tang, Pradeep Dubey, Daehyun Kim


This paper demonstrates the first tera-scale performance of Intel Xeon Phi coprocessors on 1D FFT computations. Applying a disciplined performance programming methodology of sound algorithm choice, valid performance model, and well-executed optimizations, we break the tera-flop mark on a mere 64 nodes of Xeon Phi and reach 6.7 TFLOPS with 512 nodes, which is 1.5x than achievable on a same number of Intel Xeon nodes. It is a challenge to fully utilize the compute capability presented by many-core wide-vector processors for bandwidth-bound FFT computation. We leverage a new algorithm, Segment-of-Interest FFT, with low inter-node communication cost, and aggressively optimize data movements in node-local computations, exploiting caches. Our coordination of low communication algorithm and massively parallel architecture for scalable performance is not limited to running FFT on Xeon Phi; it can serve as a reference for other bandwidth-bound computations and for emerging HPC systems that are increasingly communication limited.

Jongsoo Park - Intel Corporation

Ganesh Bikshandi - Intel Corporation

Karthikeyan Vaidyanathan - Intel Corporation

Ping Tak Peter Tang - Intel Corporation

Pradeep Dubey - Intel Corporation

Daehyun Kim - Intel Corporation

