{"id":2834,"date":"2021-04-10T20:27:05","date_gmt":"2021-04-11T03:27:05","guid":{"rendered":"https:\/\/44.203.207.232\/?p=2834"},"modified":"2021-04-11T20:29:20","modified_gmt":"2021-04-12T03:29:20","slug":"configuration-outages-that-we-should-be-learning-from","status":"publish","type":"post","link":"https:\/\/webdev.siff.io\/configuration-outages-that-we-should-be-learning-from\/","title":{"rendered":"Configuration Outages That We Should Be Learning From"},"content":{"rendered":"\n
In this week\u2019s blog, we look back at some of the major infrastructure outage news, the cost of human error, and misconfiguration. We will look at the damage, the root causes, and remedies that SIFF could have provided in addition to company instituted best practices, to prevent future outages.<\/p>\n\n\n\n
Outages occur all the time and unfortunately we have become desensitized to the realities behind the incident. It’s important from time to time, without raising too much alarm or causing undue panic, to put a magnifying glass to the actual realities that befall corporations and customers hit by outages. Today, we explore one outage and its effects on CenturyLink, a communication and network related corporation. <\/p>\n\n\n\n
I have an elderly parent with cancer, who is barely mobile. The other day, he was lying on his back on the floor of his bedroom, resting and chatting it up with me as I was visiting him from out of town. When he was done \u201cstretching\u201d and relaxing, He couldn\u2019t lift himself back up, and I needed to assist him. Imagine for a second if this had happened when no one was around. He\u2019d have to call 911.\u00a0 Imagine if over 800 of these desperate calls did not go through. sadly, this did occur, and much worse. Not only did 866 calls to 911 go undelivered, but 17 million customers across 29 states lacked reliable access to 911. Sources do not tell us the human toll of this outage, but one can only imagine!<\/p>\n\n\n\n
According to the FCC, the 37-hour outage at CenturyLink began on December 27 and was caused by an equipment failure that was exacerbated by a network configuration error. CenturyLink estimates that more than 12.1 million phone calls on its network were blocked or degraded due to the incident. <\/p>\n\n\n\n
The problems began when, “A switching module in CenturyLink’s Denver, Colorado node spontaneously generated four malformed management packets,” the FCC report said. The Malformed packets “are usually discarded immediately due to characteristics that indicate that the packets are invalid,” but that didn’t happen in this case. The switching module sent these malformed packets “as network management instructions to a line module,” and the packets “were delivered to all connected nodes,” the FCC said. Each node that received the packet then “retransmitted the packet to all its connected nodes.” The company “identified and removed the module that had generated the malformed packets.” But the outage continued because “the malformed packets continued to replicate and transit the network, generating more packets as they echoed from node to node.<\/p>\n\n\n\n
To remedy the costly $16M outage, CenturyLink said that it “has taken a variety of steps to help prevent the issue from recurring, including disabling the communication channel these malformed packets traversed during the event, and enhancing network monitoring. However, the FCC report said that several best practices could have prevented the outage or lessened its negative effects. For example, the FCC said that CenturyLink and other network operators should disable system features that are not in use.<\/p>\n\n\n\n
Source: https:\/\/www.toolbox.com\/tech\/networking\/news\/major-network-outages-in-2020-what-could-have-prevented-them\/<\/a><\/em><\/p>\n\n\n\n What becomes paramount, is ensuring that best practices and recommendations are actually implemented and continuously monitored. Anything less, is simply documentation sitting on a table or a passing verbal commitment. Proactive steps need to be taken, need to be monitored, and need regular updating. One wouldn’t benefit from a prescribed medication if it’s simply sitting in the medicine cabinet. One needs to actually take the medicine! <\/p>\n\n\n\n If we look back at other recent outages (below), misconfiguration is a constant source of major incidents and outages. Now imagine the number of incidents that are not publicly visible that occur within an organization, how much time and resources are wasted due to repeated problems that could be avoided by monitoring for configuration compliance to best practices and recommendations.<\/p>\n\n\n\n