August 31, 2023
A major threat to distributed software systems’ reliability is vicious cycles, which are observed when an event in the distributed software system’s execution causes a system degradation, and the degradation, in turn, causes more of such events. Vicious cycles often result in large-scale cloud outages that are hard to recover from due to their self-reinforcing nature.
This paper formally defines Vicious Cycle, and conducts the first in-depth study of 33 real-world vicious cycles in 13 widely-used open-source distributed software systems, shedding light on the root causes, triggering conditions, and fixing strategies of vicious cycles, with over a dozen concrete implications to combat them. Our findings show that the majority of the vicious cycles are caused by incorrect error handlers, where the handlers do not obtain enough information to distinguish between 1) an error induced by incoming requests and 2) an error induced by an unexpected interference from another error handler.
This paper further performs a feasibility study by 1) building a monitoring tool that prevents one type of vicious cycle by collecting information to make a more informed decision in error handling, and 2) investigating the effectiveness of one commonly suggested practice – injecting exponential backoff – to prevent vicious cycles induced by unconstrained retry.
About Shangshu Qian
Shangshu Qian is a Ph.D. student working with Prof. Lin Tan and Prof. Yongle Zhang in the Department of Computer Science of Purdue University.