Crossair Flight 498 from Zurich to Dresden crashed just two minutes following takeoff on 10 January 2000. The accident investigators struggled to find the reason for the crash, until they finally re-examined the pilots’ comments in the cockpit. They determined that its highly experienced pilot, Captain Pavel Gruzin, had made a right turn when he thought he was turning left.
How could a qualified pilot lose spatial orientation in this way? Investigators went to Russia where Captain Gruzin had trained to meet with Russian aviation experts there to see if they could offer any clues. They told the Swiss investigative team about a troubling group of accidents involving Russian pilots having become confused by display systems. Pilots in the former Soviet Union were trained to fly using an artificial horizon that looks different to those used in Western planes. In the West, the airplane symbol remains stable while the background moves. On Soviet planes, however, the airplane symbol moves while the horizon stays in place. Thus a left turn on a Soviet system looks very similar to a right turn on a Western display.
It appears that during the climb after takeoff (in which the autopilot was not switched on), Captain Gruzin took the aircraft into a steep dive to the right thinking he was actually turning sharply left and leveling the plane. Re-examining the blackbox recorders showed that this was despite the co-pilot, Rastislav Kolesár, telling Gruzin repeatedly that he was turning in the wrong direction. Under stress, Captain Gruzin likely fell back on what he first learned on becoming a pilot and chose the wrong direction. This is the critical relevant point to our work at Tesuto. When faced with a stressful situation, the pilot resorted to a reaction pattern which he had learned earlier, i.e. heuristics, leading him to ignore his confusion and put the plane into an unrecoverable downward spiral.
Sometimes decisions are made carefully, born out of painstaking thought and consideration. Heuristics rather are rules of thumb that we develop based on our past experiences: cognitive tools that help us make quick decisions or judgments. Heuristics aren’t about making the perfect decision; just about making one quickly. Heuristics can easily lead us to mistaken conclusions, or as Nobel Prize winner, Daniel Kahneman phrases it in his 2011 book, Thinking Fast and Slow, “we can be blind to the obvious, and we are also blind to our blindness… This is the essence of intuitive heuristics: when faced with a difficult question, we often answer an easier one instead, usually without noticing the substitution.”
Network engineers, like pilots, work across multiple systems, and are continually confronted by new technologies that are introduced across the networks they are responsible for managing. Similarly to the Captain trained on one system who then had to learn a whole new technology, learning new vendors and new systems to keep up with the continuous churn of updates, new deployments and other changes necessary can be overwhelming for the network engineer.
Network infrastructure failure occurs for multiple reasons and as detailed in the 2016 paper, ‘Evolve or Die: High-Availaility Design Principles Drawn from Google’s Network Infrastructure’, by Govindan et al., maintaining availability for content providers is challenging on a number of levels: scale, network evolution, and complexity. Even small changes or failures can have significant negative outcomes. The Google team note that, “Our findings suggest that, as networks become more complicated, failures lurk everywhere…”
Network evolution is continuous due to constantly increasing traffic demand and the rollout of new services. According to a paper Cisco put out in June 2017 on ‘The Zettabyte Era’, global IP traffic has increased by 5x over the last five years and is expected to increase nearly threefold over the next 5 years. This necessitates a rapid evolution in network hardware and software at the major ISPs. The velocity of growth in traffic is heightened by the growth in the number of products that Network Operators are offering their users. This means that the network’s hardware or software is being updated frequently, sometimes even daily, which can further accentuate network fragility.
The management of network evolution is complex; operations involve multiple steps and can take hours or days to complete and new hardware or software can introduce unexpected bugs into a system and disrupt the existing software; these types of impact are hard to predict and quantify in such a large and complex environment. Furthermore, network engineers need to engage with low-level abstractions over a sustained period of time, which can easily lead to small mistakes, which can inadvertently impact a large part of the network and bring down network availability.
As with pilots, network engineers need simulators or emulators to practice on, to test their automation in theory before feeling safe to implement it in the production network.
This is where Tesuto’s network emulator comes in.
If you want to test a new configuration in your network, you can pre-load the topology, devices and stable configuration in Tesuto, and our network emulation system can apply the intended changes and validate them in our network first, ensuring that everything runs smoothly when you roll them out to production.
Emulation is not solely about testing the manual changes before implementation, however, it is also a test of the automation. Even if a network generates the new configs automatically, Tesuto can check to see that: (a) the new config is working as expected; (b) that the automation that takes care of making production changes is also working as planned; and (c) that the state of transition from a stable config to a new one is not disruptive.
There is another school of thought in network validation: formal verification. This involves mathematically modeling a network and looking for configuration errors that way. However, that approach is not wholly sufficient for catching the complex issues that exist in modern networks. Our friends at Microsoft Researcher have closely examined this issue as part of their work on their own internal network emulator, CrystalNet, and in their recent paper, “CrystalNet: Faithfully Emulating Large Production Networks”, by H. Liu et al., they delved directly into the problems inherent in formal verification.
“These systems [formal verification models] assume an ideal model of device behavior to compute forwarding tables from configuration files. In reality, device behavior is far from ideal.”
You can compare formal verification to flight planner software, into which you can program the specifications of your flight (the plane’s weight, the amount of fuel carried, the route), and it will tell you that you “should” be able to land safely. It does not, however, factor in for bugs that might be in the software code, pilot error, unusual weather conditions or any other variable factor not programmed into the test software.
Tesuto’s network emulation, however, is the equivalent to a flight simulator. It runs on code identical to your production devices so that we can catch software bugs early, and we can factor in human error. We can emulate changing external conditions such as random outages in the same way that flight simulators can bring in unexpected turbulence to test out a plane and a pilot’s response before it is put in the air.
So as opposed to formal verification or a flight planner, which tells you that you “should” be able to land safely, Tesuto can assure you that you “will”.