The International Conference for High Performance Computing, Networking, Storage and Analysis
Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture.
Student: Carlo del Mundo (Virginia Tech)
Supervisor: Wu-chun Feng (Virginia Tech)
Abstract: Shuffle, a new mechanism in NVIDIA GPUs that allows for direct register-to-register data exchange within a warp, aims to reduce the shared memory footprint for data communication. Despite vendor claims on its efficacy, the mechanism is poorly understood with few works demonstrating performance improvement. Therefore, we seek to characterize the behavior of shuffle and provide insight into optimizing applications with intra-warp communication.
We evaluated the efficacy of the shuffle mechanism in the context of matrix transpose as part of the communication stage in a 1D FFT code. Our study indicates that refactoring algorithms to fit the shuffle paradigm requires careful co-design between software and hardware. In particular, algorithmic decisions should avoid CUDA local memory allocation and usage at all costs. Overall, our optimized shuffle version accelerates matrix transpose by up to 44% with an overall application speedup of 1.17-fold for a 256-point FFT.