System Accidents: Understanding Common Failures
Complex systems fail. Failures are a normal occurrence and how the system responds to component failures determines if the system is resilient or if it results in an incident. This is true not just for Information Systems but also for industrial and mechanical systems. Users can learn from the experience of these systems on what common themes permeate and how to best build our software factories to minimize the effect of component failures and prevent incidents from becoming accidents.
This talk walks through the issues and experiences of complex systems and identifies the common failure points within systems. With a focus on complexity, tight coupling, operator error and transitions, users can learn from the industrial and system accidents that precede them, as they work to build new, modern, efficient, safe software and systems designs. It will draw upon both industrial and software examples of how things can go wrong and the common failure points among them. By the end of this talk, attendees will have an understanding of what decisions increase the chance of failure and which decisions and designs reduce it.