February 12, 2021
With the rise in AI/ML and scientific computing GPGPUs have become a key-ingredient in modern computing. GPGPUs use the Single-Instruction-Multiple-Thread (SIMT) execution model where a group of threads—wavefront or warp—execute instructions in lockstep. When threads in a group encounter a branching instruction, not all threads in the group take the same path, and this is known as control-flow divergence. Control-flow divergence causes performance degradation because each path of a branch needs to be executed serially. There have been multiple attempts to address this issue mainly by using various architectural enhancements. We observed that some kernels with control flow divergence have many similar instructions in the divergent paths. This can be exploited to reduce divergent control-flow by executing these similar instructions from different basic blocks together. In this work we use compiler transformations to merge divergent control-flow with similar instruction sequences to reduce control-flow divergence. We show that this transformation can reduce the performance hit caused by idle threads formed due to the divergence in control flow.