How Cloudflare Incident Affected 2.4 Billion Internet Users

Updated on Nov 20, 2025 · 10 mins read

cloudflare outage incident infrastructure CDN security

On November 18, 2025, at 11:20 UTC, the internet experienced one of its most significant disruptions in recent history. Cloudflare, the infrastructure backbone serving millions of websites and applications, suffered a catastrophic outage that affected an estimated 2.4 billion monthly active users across major platforms.

For nearly six hours, users worldwide encountered HTTP 500 errors instead of their favorite services, marking Cloudflare’s worst incident since 2019.

The outage wasn’t caused by a cyberattack or malicious activity. Instead, a seemingly innocuous database permissions change triggered a cascade of failures that brought down some of the internet’s most critical services.

This incident serves as a stark reminder of how centralized our modern internet infrastructure has become, and how a single point of failure can ripple across the digital ecosystem.

Summary

Scale of Impact: 2.4 billion aggregated monthly active users affected across major platforms including ChatGPT (700M MAU), Spotify (713M MAU), X/Twitter (557M MAU), Canva (240M MAU), Discord (200M MAU), Claude (19M MAU), and Figma (13M MAU).

Duration: Core outage lasted from 11:20 UTC to 14:30 UTC (3 hours 10 minutes), with full service restoration at 17:06 UTC (5 hours 46 minutes total).

Root Cause: A ClickHouse database permissions change caused Bot Management feature file to double in size, exceeding hardcoded memory limits in Cloudflare’s proxy system.

Financial Impact: Estimated $180-360 million in lost revenue across affected services, based on conservative calculations of hourly revenue loss during peak disruption.

Services Affected: Core CDN, Turnstile, Workers KV, Access, Email Security, plus downstream impacts to ChatGPT, X, Spotify, Canva, Discord, DigitalOcean, Medium, Figma, 1Password, Trello, Namecheap, and dozens more.

The Scale of Disruption: 2.4 Billion Users Affected

When Cloudflare stumbles, the internet feels it. The company’s network sits in front of millions of websites, handling everything from DNS resolution to DDoS protection, content delivery, and security filtering.

The aggregated monthly active user count across affected major services reached approximately 2.4 billion:

ChatGPT: 700 million MAU - authentication failures and service unavailability
Spotify: 713 million MAU - streaming interruptions
X (Twitter): 557 million MAU - over 11,500 problem reports on Downdetector
Canva: 240 million MAU - design projects inaccessible
Discord: 200 million MAU - connectivity issues
Claude: 19 million MAU - service unavailable
Figma: 13 million MAU - design tools offline

The ripple effects extended to DigitalOcean, Medium, 1Password, Trello, Namecheap, Postman, Vercel, and countless other platforms.

What made this outage particularly frustrating was its intermittent nature during the first few hours. Services would recover briefly, only to fail again minutes later, leading many developers to initially suspect their own infrastructure.

Timeline of the Incident: How It Unfolded

11:05 UTC - Cloudflare deployed a database access control change to their ClickHouse cluster to improve distributed query security.

11:28 UTC - First HTTP errors appeared on customer traffic.

11:31 UTC - Automated monitoring detected the issue; manual investigation began.

11:35 UTC - Incident call created, engineering teams assembled.

11:32-13:05 UTC - Engineers attempted various mitigations for Workers KV service. Initial symptoms were misleading. Cloudflare’s status page also went down, leading some to suspect a DDoS attack.

13:05 UTC - Bypasses implemented for Workers KV and Cloudflare Access, falling back to prior proxy version.

13:37 UTC - Root cause identified: Bot Management configuration file doubled in size, exceeding the hardcoded 200-feature limit.

14:24 UTC - Automatic deployment of new Bot Management configuration files stopped.

14:30 UTC - Previous configuration file deployed globally; most services began operating normally.

17:06 UTC - Full service restoration confirmed across all Cloudflare services.

Technical Deep Dive: What Actually Broke

Request Flow:

HTTP and TLS layer
Core proxy system (FL/Frontline)
Bot Management assigns bot scores
Pingora for cache lookups or origin fetches

The Feature File Problem

The Bot Management ML model uses a “feature” configuration file refreshed every few minutes and distributed globally. Features are individual traits used to predict whether a request is automated or human-generated.

What Changed: The database permissions change at 11:05 UTC altered how ClickHouse returned metadata. The query began returning columns from both the “default” database and the underlying “r0” database, effectively more than doubling the number of features in the configuration file.

The Critical Failure Point

The Bot Management system had a hardcoded limit of 200 features (normal usage: ~60 features). When the oversized file with more than 200 features was propagated, the FL2 Rust code called Result::unwrap() on an error value, causing an unhandled panic.

Impact:

FL2 proxy: HTTP 500 errors for users
FL proxy (older): Bot scores set to zero, causing false positives for bot-blocking rules

Why It Was Intermittent

The feature file was regenerated every five minutes by querying the ClickHouse cluster, which was being gradually updated. Bad data was only generated if the query ran on an updated node. Every five minutes, there was a chance of either a good or bad file being generated.

Eventually, every ClickHouse node was updated, and the outage stabilized in the failing state.

Financial Impact: Counting the Cost

The financial impact of a six-hour outage affecting 2.4 billion users is staggering. Conservative estimates suggest direct revenue loss of $180-360 million across major affected services:

Major Service Losses:

ChatGPT/OpenAI: ~$1.4M in subscription value + API revenue
Spotify: ~$4.3M in subscriptions + advertising losses
X (Twitter): $5-7M in advertising revenue
Canva: $2-3M in direct revenue + enterprise productivity losses
Enterprise services ( DigitalOcean, Figma, 1Password): Millions in SLA credits

Indirect Costs:

Customer support expenses
Engineering incident response time
SLA credits and compensation
Customer churn
Reputational damage

For Cloudflare itself, SLA credits and enterprise compensation likely cost tens of millions of dollars.

The broader impact extends to millions of smaller businesses: e-commerce sites lost sales, SaaS applications couldn’t serve customers, and content creators couldn’t reach audiences. The aggregate impact across hundreds of thousands of affected sites likely added hundreds of millions more to the total cost.

Affected Services: The Ripple Effect

Core Cloudflare Services

CDN and Security: HTTP 500 errors instead of content for millions of websites

Turnstile: CAPTCHA alternative failed, breaking authentication flows including Cloudflare’s own dashboard login

Workers KV: Elevated HTTP 500 error rates affected edge computing applications and internal Cloudflare services

Cloudflare Access: Zero trust authentication failures from incident start until 13:05 UTC

Email Security: Processing continued, but temporary loss of IP reputation source reduced spam detection accuracy

Downstream Service Impacts

Consumer Services:

ChatGPT (700M MAU): Largely inaccessible
X/Twitter (557M MAU): Struggled to load
Spotify (713M MAU): Streaming unavailable
Canva (240M MAU): Design projects inaccessible
Discord (200M MAU): Communities went silent

Developer Tools:

DigitalOcean: Control panel inaccessible
Figma (13M MAU, 95% of Fortune 500): Offline during business hours
1Password, Trello, Namecheap, Postman, Vercel: All experienced disruptions

Global Impact

Cloudflare operates data centers in over 300 cities worldwide. The outage affected all regions simultaneously—North America, Europe, Asia, Africa, and South America—highlighting the systemic nature of the problem.

Cloudflare’s Response and Remediation

13:05 UTC - Engineers implemented bypasses for Workers KV and Cloudflare Access, falling back to an earlier proxy version to reduce customer impact.

13:37 UTC - Bot Management configuration file identified as the trigger.

14:24 UTC - Automatic deployment of new Bot Management files stopped.

14:30 UTC - Previous configuration file deployed globally, forcing proxy restart across entire network.

17:06 UTC - Full service restoration confirmed after six hours.

Cloudflare CEO Matthew Prince issued a public apology: “An outage like today is unacceptable. We’ve architected our systems to be highly resilient to failure to ensure traffic will always continue to flow.”

Lessons Learned and Prevention Measures

Cloudflare outlined specific remediation steps in their post mortem:

1. Configuration File Validation: Validate all configuration files, even internally generated ones, with size limits and format validation enforced at multiple layers.

2. Improved Error Handling: Replace Result::unwrap() with proper error handling that logs errors, falls back to known-good configurations, and continues serving traffic.

3. Global Kill Switches: Allow engineers to instantly disable features across the entire network without code deployment, enabling minutes rather than hours recovery time.

4. Observability System Redesign: Prevent debugging systems from consuming excessive CPU during incidents.

5. Memory Allocation Strategy: Replace hard 200-feature limit with dynamic allocation and soft limits that trigger warnings rather than failures.

6. Database Change Testing: Require comprehensive testing for schema and permission changes, including validation of all affected queries.

7. Configuration Monitoring: Track file sizes, generation times, and distribution patterns to detect anomalies before production deployment.

8. Gradual Rollout Strategy: Develop better strategies for infrastructure changes, including canary deployments with automated validation.

The Broader Implications for Internet Infrastructure

This outage raises important questions about centralization and systemic risk.

Cloudflare’s Scale:

Serves over 20% of all websites
Processes more than 50 million HTTP requests per second
Handles DNS queries for millions of domains

Key Takeaways

Single Point of Failure: ChatGPT, Spotify, X, and Canva—operated by completely different companies—all failed simultaneously due to shared Cloudflare dependency.

Multi-Cloud Strategies: Organizations need failover capabilities and parallel infrastructure with alternative CDN providers, though this adds cost and complexity.

Graceful Degradation: Services should detect failures and fall back to direct connections or alternative paths rather than hard failing.

Regulatory Implications: Incidents like this may accelerate discussions about critical infrastructure designation and oversight for major internet service providers.

Technical Lessons: The unhandled panic when a configuration file exceeded expected size represents a common pattern. Building resilient systems requires anticipating the unexpected and designing for failure modes.

Comparing to Previous Major Outages

Cloudflare July 2019: Bad WAF regex caused CPU exhaustion. Duration: 27 minutes vs. six hours in 2025.

Cloudflare June 2025: Google Cloud infrastructure failure. The November incident was self inflicted, making it more concerning.

AWS December 2021: US-EAST-1 outage lasted 7+ hours, affecting Netflix, Disney+, Robinhood. Cause: Internal automation error.

Facebook October 2021: 6-hour BGP routing error disconnected data centers, affecting billions across Facebook, Instagram, WhatsApp, Oculus.

Fastly June 2021: Less than 1 hour, affecting Amazon, Reddit, Twitch. Rapid recovery highlights importance of quick diagnosis.

What Makes This Unique

The November 2025 Cloudflare outage combined duration, scope, and user impact like few others. The 2.4 billion affected users represents a significant portion of the global internet using population, making this one of the most impactful outages in internet history.

What This Means for Cloudflare Customers

For organizations relying on Cloudflare, this six-hour outage demands reassessment of risk management strategies.

Key Considerations

Continue Using Cloudflare?

Non-critical apps: Cloudflare’s performance, features, and pricing remain compelling
Mission-critical apps: Implement redundancy to mitigate downtime risk

Multi-CDN Strategies: Configure DNS to distribute traffic across multiple CDN providers for automatic failover. Tools like Cedexis and NS1 specialize in intelligent traffic management.

Graceful Degradation: Design applications to function when Cloudflare is unavailable by falling back to direct origin connections, serving cached content, or displaying informative error messages.

SLA Review: Verify if this outage triggered SLA credits (typically 99.9% to 100% uptime commitments) and evaluate whether SLAs adequately reflect business impact.

Independent Monitoring: Use third-party services like Pingdom, UptimeRobot, or Datadog to provide early warning independent of Cloudflare’s dashboard.

Conclusion:

The November 18, 2025 Cloudflare outage triggered by a faulty Bot Management configuration file that exceeded memory limits disrupted services for 2.4 billion users and caused massive economic losses. A combination of unhandled Rust errors, missing validation, and lack of fast kill switches turned a routine change into Cloudflare’s longest outage since 2019, lasting nearly six hours. The incident highlights critical lessons: Cloudflare needs stronger diagnostics, failover systems, and emergency controls; the tech industry must reduce systemic risks through redundancy and architectural diversity; and users are reminded how much the modern internet depends on a few key providers. While Cloudflare is now improving validation, error handling, global kill switches, and testing, the event will remain a major reminder of how challenging it is to build a resilient internet for billions.