Error code 500：Cloudflare全球网络故障事件

参考资料#

北京时间 2025 年 11 月 18 日，对于全球互联网而言是一个不太平静的日子。晚上 7 点 20 分左右，作为互联网基础设施的重要一环的 Cloudflare 经历了自 2019 年以来最严重的一次全球性服务中断，导致大量网站和在线服务无法访问。事件发生约 3 小时后，核心服务才基本恢复，所有系统在次日凌晨 1 点 06 分完全恢复正常。这次事件持续时长约 6 小时。

然鹅，“覆巢之下，焉有完卵”，我的小破站也随之崩溃，悲😢（图中时间换算成北京时间为 2025-11-18 19:54:37）

当时 Cloudflare 的状态页面显示，” Cloudflare 全球网络遇到问题”👇

甚至，用来检测网站是否宕机的 Downdetector，自己因为使用 Cloudflare 而出现了短暂的宕机

哦对了，还有 Cloudflare 自家的仪表板

🔍事件的来龙去脉#

🔧平静前的更改#

当天北京时间晚北京时间上 7 点 05 分，Cloudflare 的工程师对其 ClickHouse 数据库集群进行了一项权限变更。这项变更是为了提升分布式查询的安全性和可靠性，确保查询在初始用户账户下执行，以实现更精细的权限控制。当时，没有人意识到这个善意的调整会埋下怎样的隐患。

💥崩溃的开始#

仅仅 15 分钟后，在北京时间晚上 7 点 20 分，灾难准时降临。 Cloudflare 的全球网络开始出现严重故障，用户尝试访问依赖 Cloudflare 服务的网站时，看到的只是一个冰冷的 HTTP 5xx 错误页面。

问题的核心指向了 Cloudflare 的 Bot 管理系统使用的一个核心文件——“特征文件”。这个文件就像一个”识别手册”，每五分钟自动生成一次，并快速发布到全球所有服务器上，用于帮助系统识别恶意机器人流量。

然而，那次在晚上 7 点 05 分进行的数据库权限变更，导致生成此文件的查询行为发生了微妙变化。查询在获取数据时，意外地返回了大量重复的”特征”列，使得这个本该尺寸稳定的配置文件体积翻了一倍。

当这个”臃肿”的文件被分发到全球服务器并尝试加载时，触发了核心代理软件中一个严格的安全限制——该软件对特征文件的大小有预设的上限（低于翻倍后的大小）。于是，软件崩溃了。由于这个处理流量的核心代理几乎接触网络上的每一个请求，故障迅速蔓延至整个全球网络。

🌀混乱的诊断过程#

最初，Cloudflare 内部团队看到系统时好时坏的症状，曾怀疑遭遇了超大规模的 DDoS 攻击。

更富戏剧性的是，一个巧合加深了这种怀疑：完全托管在 Cloudflare 基础设施之外的官方状态页面也发生了故障。这导致内部部分人员一度认为攻击者可能同时在以 Cloudflare 的系统及其状态页面为目标。

事实上，这种时好时坏的现象，正是故障的真正根源所在。因为那个有问题的特征文件，仅在数据库集群中已更新的部分节点上才会生成。每五分钟，全球网络可能收到一份”好”的配置文件，系统短暂恢复；也可能收到一份”坏”的，系统再次崩溃。这种波动使得团队在初期难以迅速锁定问题本质。

🛠️艰难的恢复#

经过一番排查，工程师们最终抓住了问题的元凶——那个因数据库权限变更而异常增大的特征文件。

北京时间晚上 10 点 30 分左右，救援行动全面展开。团队阻止了异常文件的继续传播，并手动将一份早期已知的、正常的版本重新插入到特征文件的分发队列中。随后，他们开始强制重启全球网络中受影响的核心代理系统。至此，核心流量大致恢复正常。

在接下来的几个小时里，团队持续努力，缓解因流量恢复上线而对网络各部分造成的负载增加。直到北京时间次日凌晨 1 点 06 分，Cloudflare 宣布所有系统均已恢复正常运行。

这次中断事件清晰地展示了现代互联网生态中高度的相互依存性，以及一个微小技术问题在全球化的基础设施中被放大后所能产生的深远影响。

PS:

😂混乱中的一丝幽默#

当 Cloudflare 的工程师们正在焦头烂额地排查故障时，互联网的另一个角落，一位名为 MrShibolet 的网友进行了一次”自黑式”整活。

在服务中断期间，MrShibolet 发布了一条推文，内容是： “First day at CloudFlare , pushed a little update and taking the afternoon off ” （译”在 CloudFlare 的第一天，推送了一个小更新，下午休息了”）并配上一个溜走的表情”✌️”。

值得一提的是，2025 年 10 月 20 日 AWS DNS 故障的时候，MrShibolet 也发过一条一模一样的推文，这次不过是把名字换成了 Cloudflare（我：🤣）。

🖼️梗图#

在网络上找了点事件相关的梗图，仅供大家一乐，请勿过度解读：

让我们为此次事件默哀：

以下是事件解决后相关 Cloudflare 状态页最终页面的全文中英双语对照翻译

按照更新时间正序排序，加入了换算后的北京时间（UTC+8）

Cloudflare 全球网络遇到问题

Cloudflare Global Network experiencing issues

Cloudflare 事件报告

Incident Report for Cloudflare

事件时间线#

调查中 / Investigating
Cloudflare 正在经历内部服务降级。部分服务可能会间歇性受影响。我们正专注于恢复服务。一旦能够修复，我们将提供更新。稍后将提供更多更新。
Cloudflare is experiencing an internal service degradation. Some services may be intermittently impacted. We are focused on restoring service. We will update as we are able to remediate. More updates to follow shortly.
2025-11-18 19:48 UTC+8 / 2025-11-18 11:48 UTC

更新 / Update
我们正在继续调查此问题。
We are continuing to investigate this issue.

2025-11-18 20:03 UTC+8 / 2025-11-18 12:03 UTC

更新 / Update
我们看到服务正在恢复，但在我们继续修复工作时，客户可能仍会观察到高于正常的错误率。
We are seeing services recover, but customers may continue to observe higher-than-normal error rates as we continue remediation efforts.

2025-11-18 20:21 UTC+8 / 2025-11-18 12:21 UTC

更新 / Update
我们正在继续调查此问题。
We are continuing to investigate this issue.

2025-11-18 20:37 UTC+8 / 2025-11-18 12:37 UTC

更新 / Update
我们正在继续调查此问题。
We are continuing to investigate this issue.

2025-11-18 20:53 UTC+8 / 2025-11-18 12:53 UTC

更新 / Update
在尝试修复期间，我们已禁用伦敦的 WARP 访问。伦敦用户尝试通过 WARP 访问互联网将看到连接失败。
During our attempts to remediate, we have disabled WARP access in London. Users in London trying to access the Internet via WARP will see a failure to connect.

2025-11-18 21:04 UTC+8 / 2025-11-18 13:04 UTC

已识别 / Identified
问题已识别，正在实施修复。
The issue has been identified and a fix is being implemented.

2025-11-18 21:09 UTC+8 / 2025-11-18 13:09 UTC

更新 / Update
我们已进行更改，使 Cloudflare Access 和 WARP 恢复。Access 和 WARP 用户的错误率已恢复到事件前水平。我们已重新启用伦敦的 WARP 访问。我们正在继续努力恢复其他服务。
We have made changes that have allowed Cloudflare Access and WARP to recover. Error levels for Access and WARP users have returned to pre-incident rates. We have re-enabled WARP access in London. We are continuing to work towards restoring other services.

2025-11-18 21:13 UTC+8 / 2025-11-18 13:13 UTC

更新 / Update
我们正在继续为应用服务客户恢复服务。
We are continuing working on restoring service for application services customers.

2025-11-18 21:35 UTC+8 / 2025-11-18 13:35 UTC

更新 / Update
我们正在继续为应用服务客户恢复服务。
We are continuing working on restoring service for application services customers.

2025-11-18 21:58 UTC+8 / 2025-11-18 13:58 UTC

更新 / Update
我们正在继续修复此问题。
We are continuing to work on a fix for this issue.

2025-11-18 22:22 UTC+8 / 2025-11-18 14:22 UTC

更新 / Update
我们已部署更改，恢复了仪表板服务。我们仍在努力修复广泛的应用服务影响。
We’ve deployed a change which has restored dashboard services. We are still working to remediate broad application services impact.

2025-11-18 22:34 UTC+8 / 2025-11-18 14:34 UTC

监控中 / Monitoring
修复已实施，我们认为事件现已解决。我们正在继续监控错误，以确保所有服务恢复正常。
A fix has been implemented and we believe the incident is now resolved. We are continuing to monitor for errors to ensure all services are back to normal.

2025-11-18 22:42 UTC+8 / 2025-11-18 14:42 UTC

更新 / Update
部分客户可能仍遇到登录或使用 Cloudflare 仪表板的问题。我们正在修复此问题，并继续监控任何进一步问题。
Some customers may be still experiencing issues logging into or using the Cloudflare dashboard. We are working on a fix to resolve this, and continuing to monitor for any further issues.

2025-11-18 22:57 UTC+8 / 2025-11-18 14:57 UTC

更新 / Update
我们正在继续监控任何进一步问题。
We are continuing to monitor for any further issues.

2025-11-18 23:23 UTC+8 / 2025-11-18 15:23 UTC

更新 / Update
团队正继续专注于修复后恢复服务。我们正在缓解部署后遗留的几个问题。
The team is continuing to focus on restoring service post-fix. We are mitigating several issues that remain post-deployment.

2025-11-18 23:40 UTC+8 / 2025-11-18 15:40 UTC

更新 / Update
在我们进行全局恢复期间，机器人分数可能会间歇性受影响。一旦我们认为机器人分数完全恢复，我们将提供更新。
Bot scores will be impacted intermittently while we undergo global recovery. We will update once we believe bot scores are fully recovered.

2025-11-19 00:04 UTC+8 / 2025-11-18 16:04 UTC

更新 / Update
我们继续看到错误和延迟改善，但仍收到间歇性错误报告。团队继续监控情况改善，并寻找加速完全恢复的方法。
We continue to see errors and latency improve but still have reports of intermittent errors. The team continues to monitor the situation as it improves, and looking for ways to accelerate full recovery.

2025-11-19 00:27 UTC+8 / 2025-11-18 16:27 UTC

更新 / Update
我们在全球范围内处理服务并清除剩余错误和延迟时，继续看到错误下降。
We continue to see errors drop as we work through services globally and clearing remaining errors and latency.

2025-11-19 00:46 UTC+8 / 2025-11-18 16:46 UTC

更新 / Update
我们继续通过恢复监控系统，并看到错误和延迟恢复正常水平。完整的事后调查和事件详情将尽快提供。
We continue to monitor the system through recovery and we are seeing errors and latency return to normal levels. A full post-incident investigation and details about the incident will be made available asap.

2025-11-19 01:14 UTC+8 / 2025-11-18 17:14 UTC

更新 / Update
Cloudflare 服务目前运行正常。我们不再观察到整个网络中的错误率或延迟升高。我们的工程团队继续密切监控平台，并对早先的中断进行深入调查，但此时未进行任何配置更改。此时，重新启用事件期间临时禁用的任何 Cloudflare 服务被认为是安全的。调查完成后，我们将提供最终更新。
Cloudflare services are currently operating normally. We are no longer observing elevated errors or latency across the network. Our engineering teams continue to closely monitor the platform and perform a deeper investigation into the earlier disruption, but no configuration changes are being made at this time. At this point, it is considered safe to re-enable any Cloudflare services that were temporarily disabled during the incident. We will provide a final update once our investigation is complete.

2025-11-19 01:44 UTC+8 / 2025-11-18 17:44 UTC

已解决 / Resolved
此事件已解决。
This incident has been resolved.

2025-11-19 03:28 UTC+8 / 2025-11-18 19:28 UTC

受影响的服务#

Access
Bot 管理
CDN/缓存
仪表板
防火墙
网络
WARP
Workers