Cloudflare Global Outage: BGP Route Error & Recovery

What Happened: Unintended BGP Route Withdrawals Knocked Services Offline

A major outage on 20 February disrupted global internet traffic after an internal configuration failure at Cloudflare caused the unintended withdrawal of customer BGP routes. The incident lasted just over six hours and made numerous services unreachable, even as early fears pointed to a possible cyberattack.

At the core of the disruption was the withdrawal of customer routes that normally keep internet traffic flowing to the right networks. When those routes disappeared, many connections were pushed into BGP path hunting rather than stable routing, creating widespread reachability problems across dependent services.

Root Cause: An Internal Update Deleted Over a Thousand BYOIP Prefixes

The outage was traced back to an internal update that led to the systematic deletion of more than a thousand Bring Your Own IP (BYOIP) prefixes. Those prefixes are essential for maintaining customer routing presence, and their removal meant critical bindings for hundreds of customers were no longer in place.

Cloudflare engineers connected the issue to an error in the company’s Addressing API. The error was introduced during an automated cleanup task under the Code Orange resilience programme—an automation-driven effort that, in this case, became a single point of operational risk.

The Addressing API Failure: A Flawed Query With a Dangerous Interpretation

The specific failure mode was blunt and consequential: a flawed query interpreted an empty value as an instruction to delete all returned prefixes. Instead of doing a limited cleanup, the automation removed a large set of prefixes, including those required for customer connectivity.

This kind of failure is especially impactful in global routing contexts because deletions don’t just “degrade” service—they can actively remove the paths the internet uses to reach a destination. Once routes are withdrawn, recovery isn’t simply flipping a switch; it’s rebuilding the routing state reliably and safely.

Why Recovery Took Hours: Different Prefix Severities Required Different Fixes

Restoration took several hours because the withdrawn prefixes varied in severity, requiring different recovery methods instead of a uniform reinstatement process. That detail matters: it suggests the outage didn’t present as one clean, identical failure across all customers, but rather a range of routing impacts that had to be handled with more than a one-size-fits-all rollback.

Dashboard Self-Recovery vs Manual Reconstruction Across the Edge Network

Some users were able to restore connectivity through the dashboard, implying certain bindings or configurations could be re-established through normal customer-facing controls.

Others weren’t so lucky. For those customers, engineers had to perform manual reconstruction carried out across the edge network. That kind of manual work is slow by nature, and when it’s happening during an active global disruption, it’s also high-risk—because every corrective action must avoid making routing instability worse.

Which Cloudflare Services Were Affected

The outage affected a series of core offerings, including:

Content delivery
Security layers
Dedicated egress
Network protection services

These are foundational services that many websites and applications rely on for availability and reachability. When they become unstable or unreachable, the ripple effects show up quickly as site and app failures across the internet.

User Impact: Timeouts, Unreachable Apps, and 403 Responses on 1.1.1.1

The incident triggered widespread timeouts on dependent websites and applications, a common symptom when routing becomes inconsistent and traffic can’t reliably find its destination.

It also caused 403 responses on the 1.1.1.1 DNS resolver. That’s a notable user-facing signal because it’s a widely used resolver, and errors there can immediately feel like “the internet is down,” even when the deeper issue is routing reachability and route withdrawals.

Cloudflare’s Stated Next Steps: API Guardrails, Circuit Breakers, and Separation

Cloudflare plans to introduce multiple changes aimed at preventing a repeat of this automation-driven failure:

Stricter API Validation to Prevent Unsafe Deletions

Cloudflare plans stricter API validation—an attempt to ensure that inputs like empty values can’t be interpreted in a way that triggers destructive actions at scale.

Circuit Breakers for Abnormal Deletion Patterns

Circuit breakers are planned for abnormal deletion patterns, a practical safeguard that can stop automation when it behaves in an unexpected, high-impact way (like mass deletions).

Improved Configuration Separation

Cloudflare also plans improved configuration separation, which can help limit the blast radius when internal tooling or automated jobs behave incorrectly.

Why This Incident Matters: Automation Faults Can Become Critical Infrastructure Failures

Cloudflare issued a public apology for a failure that undermined its assurances of network resilience. Beyond the apology, the event reaffirmed the risks posed by internal automation faults when they interact with critical internet infrastructure.

This is the uncomfortable truth: when automation has the authority to change routing-relevant configuration at scale, a small logic mistake can escalate into a global outage—fast. And once BGP routes are withdrawn, the internet doesn’t “heal” instantly; it has to reconverge, and humans often have to rebuild what automation removed.