Digital Monoculture
The Hidden Risks of Standardization

In today's era many companies rely on the cloud services and cybersecurity solutions leading to a digital monoculture. While this standardization promotes operation and broad compatibility of computer systems, it also poses risks. When issues arise they can quickly spread across industries and regions as seen in the CrowdStrike incident.

The interconnected nature of IT infrastructure means that a failure, in one component can trigger a domino effect affecting parts of the system. With software and networks becoming unforeseen interactions and bugs are more likely to occur. Even a small software update can have unintended consequences that spread rapidly throughout the network. This is precisely what has been observed. Systems coming to a standstill before measures could be taken to prevent it.

The involvement of Microsoft in the IT outage serves as an example. When Windows computers worldwide experienced crashes accompanied by the screen of death (BSoD), initial reports pointed towards Microsoft. Indeed, Microsoft acknowledged an outage in their cloud services, within the Central United States region starting at 6 pm Eastern Time on Thursday, July 18 2024. The recent service disruption impacted a group of customers who rely on Azure services, which are part of Microsoft's cloud platform.

The repercussions of the Azure service interruption were significant causing disruptions, in industries such as airlines, retail, banking and media in both the United States and internationally in countries like Australia and New Zealand. Several Microsoft 365 services like Power BI, Microsoft Fabric and Teams were also affected. Interestingly, the root cause of the Azure outage was linked to the CrowdStrike update impacting Microsoft's virtual machines running Windows with Falcon installed.

This incident highlights lessons for IT management

  • Diversify IT Resources
    Companies should consider implementing a multi cloud approach by spreading their IT infrastructure across different cloud service providers. This strategy ensures that if one provider experiences issues, others can continue supporting functions.
  • Isolate Critical Infrastructure from the Public Internet
    Critical infrastructure should not be connected to the public internet. This isolation can prevent external threats from exploiting vulnerabilities. Implementing strong internal network security measures ensures that even if an external network faces issues, the critical infrastructure remains unaffected and operational.
  • Redundancy
    Business continuity planning should include redundancies within IT systems. Having servers, alternative data centres and failover mechanisms can facilitate a transition to backup systems in case of an outage.
  • Automate Standard IT Procedures
    Automation of routine IT tasks can help mitigate the risk of human errors that often lead to service disruptions. Automated systems are capable of monitoring for issues. Taking proactive measures to address them before they escalate into significant problems.
  • Ensure Staff Training for Handling Outages
    Providing staff with training on how to respond during outages can be pivotal in managing situations. This involves being aware of whom to reach out to the steps to take and adeptness in utilizing workflows.

The Severity Potential of IT Outages

While it's highly improbable that a complete global internet outage could occur due to the internets distributed and decentralized infrastructure, the risk of disruptions beyond what was experienced in the CrowdStrike incident remains. The list of causes resembles scenarios from a disaster movie:

  • Damage to Undersea Fibre Optic Cables
    Damage to cables—whether due to natural disasters, seismic activities, accidents or intentional sabotage—could result in significant disruptions to international internet traffic.
  • Coordinated Cyber Attacks
    Advanced attacks aimed at internet infrastructure, like root DNS servers or major internet exchange points have the potential to trigger large scale outages.
  • Intense Solar Flares
    Comparable to the Carrington Event of 1859 such solar flares could inflict damage on satellites, power grids and undersea cables resulting in continent spanning internet outages lasting for extended durations.

Takeaway

While a complete global internet collapse is highly unlikely, the interconnected nature of our digital infrastructure means that significant outages can have widespread and severe impacts. By diversifying IT resources, isolating critical infrastructure, implementing redundancy, automating standard procedures, and training staff for outages, companies can better prepare and safeguard against these risks. Adapting and being prepared are essential to maintaining the resilience of our communication networks.

Leave a Reply

Your email address will not be published. Required fields are marked *

 


All comments are moderated before being published. Inappropriate or off-topic comments may not be approved.