October 5, 2018
A fundamental challenge in automated reasoning about programming assignments at scale is clustering student submissions based on their underlying algorithms. State-of-the-art clustering techniques are sensitive to control structure variations, cannot cluster buggy solutions with similar correct solutions, and require expensive pair-wise program analyses. We propose a novel technique that can cluster assignments based on their algorithmic essence: (1) how the input space is partitioned into equivalence classes and (2) how the problem is uniquely addressed within individual equivalence classes. To capture (1), we leverage model counting to identify the number of inputs belonging to an input equivalence class. To compute (2), we extract the number of occurrences of a unique pair of consecutive values of a variable during its lifetime. A program embedding is constructed as a vector of these features. Programs are then clustered using the vector representations. The evaluation of our tool SemCluster on 9 real-world programming problems from CodeChef with 3,135 submissions shows that SemCluster generates 4-10 clusters, with each cluster precisely representing a unique algorithm to address a problem. In contrast, other prominent tools generate 27-125 clusters. We further demonstrate that SemCluster can be used to substantially improve the performance of an existing automated feedback generation system on incorrect solutions (with an average feedback accuracy of 84.17% when using a reference program from the same cluster and only 7.73% when using reference programs outside the same cluster).