NOV 16-22, 2013

Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture

SESSION: ACM Student Research Competition Poster Reception

EVENT TYPE: ACM Student Research Competition Posters, ACM Student Research Competition

TIME: 5:15PM - 7:00PM

AUTHOR(S):Carlo del Mundo

ROOM:Mile High Pre-Function

Shuffle, a new mechanism in NVIDIA GPUs that allows for direct register-to-register data exchange within a warp, aims to reduce the shared memory footprint for data communication. Despite vendor claims on its efficacy, the mechanism is poorly understood with few works demonstrating performance improvement. Therefore, we seek to characterize the behavior of shuffle and provide insight into optimizing applications with intra-warp communication. We evaluated the efficacy of the shuffle mechanism in the context of matrix transpose as part of the communication stage in a 1D FFT code. Our study indicates that refactoring algorithms to fit the shuffle paradigm requires careful co-design between software and hardware. In particular, algorithmic decisions should avoid CUDA local memory allocation and usage at all costs. Overall, our optimized shuffle version accelerates matrix transpose by up to 44% with an overall application speedup of 1.17-fold for a 256-point FFT.

Chair/Author Details:

Carlo del Mundo - Virginia Tech

