December 4, 2020
Common real-world file formats, such as PDF, are written and read by programs with different origins and implementations. Even without bugs, this variety often leads to multiple interpretations of the underlying specification, creating an implicit, de facto format, which can create vulnerabilities in programs handling the format’s data. One mechanism for understanding the differences between the specification and the de facto format is grammar inference. However, most grammar inference efforts are targeted at natural language. In this talk, we’ll go over some state-of-the-art methods for inferring grammars and generating parsers from data, and proceed to how they might apply to data formats. Finally, we’ll discuss reinforcement learning, and how leveraging the ability to search through policy space might be a better fit for inferring data grammars than previous methods. Understanding the intricacies of how different programs interpret data, despite following the same specification, can help with writing more secure, less ambiguous software going forward, and aid in the detection of deliberately malicious payloads
About Walt Woods (Galois)
Dr. Woods joined Galois in 2019, and focuses on machine learning research spanning the development of both fundamental algorithms and accessible software tooling. Dr. Woods earned his Ph. D. from Portland State University in 2019 for the conception and development of adversarial explanations, which leverage a security loophole in state-of-the-art neural networks to more robustly demonstrate their internal logic. Before pursuing a graduate degree, he was a senior software engineer working with a variety of problem domains including parallel processing, productivity tool development, and user-friendly API and programming language design.