Global Internet Outage Triggered by Edgoo Networks – August 23-24, 2024

Global Internet Outage Triggered by Edgoo Networks – August 23-24, 2024
Photo by JJ Ying / Unsplash

On August 23, 2024, at 11:38 PM Local Datacentre Time (5:38 PM Eastern Time), a major internet outage began, which lasted until August 24, 2024, at 1:25 AM Local Datacentre Time (7:25 PM Eastern Time). This outage, while originating from a network issue at Edgoo Networks, a Tier 2 Internet Service Provider (ISP), had far-reaching consequences that affected internet connectivity globally. The outage impacted multiple service providers, including the Tier 1 ISP Cogent Communications, one of the largest ISPs in the world, which operates in over 251 markets. The cascading effects of this outage left millions of users offline and caused significant disruptions across businesses, services, and critical infrastructures.

This post-mortem analysis will explore the key events leading up to the incident, the underlying causes, the impact on the network infrastructure, and the steps taken to resolve the issue. Furthermore, it will highlight lessons learned and steps to improve future resilience.

Incident Timeline and Key Events

The outage began at 11:38 PM Local Datacentre Time when Edgoo Networks experienced a network failure (network loop) that caused a temporary service disruption. As Edgoo attempted to resolve the network loop and bring their network back online, it resulted in a flood of traffic from a large number of users simultaneously reconnecting and being re-announced on the network.

At its core, a network loop occurs when data packets are continuously circulated between devices in a closed path. Without proper filtering or network management, this loop can generate immense amounts of traffic, overwhelming routers and switches. This is exactly what happened in this incident. Edgoo's network loop created a massive influx of traffic that overwhelmed not only their systems but also affected their upstream providers after terminating the loop.

The problem quickly escalated when this surge of traffic propagated to one of their upstream providers, Cogent Communications, a Tier 1 ISP. Cogent, being a major backbone provider that services a global network, immediately began to feel the strain. As the network struggled to process the overwhelming amount of traffic, Cogent's systems experienced significant delays, which extended the downtime for users attempting to reconnect. The ripple effect of this outage was felt across the internet, as Cogent's inability to handle the influx of data impacted connectivity for businesses and individuals across numerous markets.

An image depicting an influx of BPG announcements at Cogent Communications during the time of the outage and several BGP route leaks occuring at the same time
BPG Statistics, Courtesy of Cloudflare, Inc.

The Role of Edgoo Networks and Cogent Communications

As a Tier 2 ISP, Edgoo Networks relies on upstream providers like Cogent Communications to route traffic to broader parts of the internet. Tier 2 ISPs generally serve as intermediaries, connecting smaller regional ISPs to the larger backbone providers (Tier 1 ISPs). When Edgoo experienced its network disruption, the situation quickly got out of hand because the network loop had caused such a level of disconnects for their users, that it caused a flood of re-announcements to Cogent.

Once this traffic hit Cogent Communications, a Tier 1 ISP, the impact became even more severe. Cogent, being one of the top five ISPs globally, services major markets and is responsible for routing a large portion of the internet’s traffic. The unanticipated flood of reconnections from Edgoo placed Cogent’s systems under immense strain. This event triggered widespread connectivity issues, slowing down the restoration process for hours.

A Tier 1 ISP like Cogent typically has robust systems designed to handle high traffic volumes, but even these systems have limits. The flood of traffic effectively saturated their infrastructure, leading to congestion across multiple points in their network. As a result, it wasn’t just Edgoo’s customers that were impacted. Cogent’s own customers, including major businesses and other ISPs, experienced outages and degradation in service. Given the global reach of Cogent, the outage cascaded across different regions, disrupting services in a wide variety of sectors.

Response and Mitigation Efforts

From the moment the outage began, teams from both Edgoo Networks and Cogent Communications worked diligently to identify the root cause of the issue. Network engineers at Edgoo quickly identified the network loop and resolved it, but as a result all previously disconnected users now needed to be re-announced.

Meanwhile, Cogent faced a more complex challenge. The flood of traffic from Edgoo had already overwhelmed multiple nodes in their global network. Restoring service required a coordinated effort across multiple regions to reroute traffic, clear the congestion, and ensure that their systems were prepared to handle the remaining traffic load once Edgoo was fully back online. This process took longer than expected due to the sheer scale of the traffic and the need for thorough testing before the network could be declared fully operational.

Despite the swift response from both ISPs, the outage persisted for nearly two hours. Although the situation was eventually resolved, it highlighted several points of failure in network design, traffic management, and coordination between ISPs.

Impact and Global Effect

The global impact of this outage cannot be overstated. The initial disruption at Edgoo, compounded by Cogent’s service degradation, created a domino effect that extended far beyond their own networks. Customers of TCF Ventures, a company reliant on Edgoo’s services through its upstream provider, were particularly hard-hit. Businesses experienced significant downtime, affecting their ability to process transactions, provide services, and maintain communication with clients.

The outage also had an impact on cloud services, internet hosting providers, and other digital platforms that rely on Cogent’s backbone network. In particular, any company or service that relied heavily on real-time communication or data transmission felt the effects most acutely. Industries ranging from finance to e-commerce, healthcare, and entertainment saw disruptions, with some services offline for hours, even after the initial issue was resolved.

The incident underscores the interconnectedness of the global internet infrastructure and how a failure at one point in the chain can have devastating consequences for a wide range of users.

Conclusion and Next Steps

While the incident was beyond the direct control of TCF Ventures and its direct service providers, it serves as a powerful reminder of the fragility of even the most robust internet infrastructures. ISPs, including Edgoo and Cogent, responded quickly to the issue, but the event itself highlighted areas where improvements are necessary. The primary lesson learned here is the importance of robust traffic management systems and safeguards to prevent network loops and overloads during recovery efforts.

Moving forward, both Edgoo and Cogent will likely need to review their network configurations, response strategies, and inter-provider communication protocols. ISPs must ensure that their systems are better equipped to handle massive influxes of traffic during outage recovery. Additionally, closer cooperation between Tier 2 and Tier 1 ISPs during network disruptions could help mitigate the global effects of such incidents.

TCF Ventures and its partners will continue to monitor the situation closely, ensuring that safeguards are in place to minimize the risk of similar outages in the future. Ultimately, while such outages may be rare, the speed and efficiency of the response can significantly impact their overall severity.

Read more