9798 words
49 minutes
Summary of the Amazon DynamoDB Service Disruption in the Northern Virginia (US-EAST-1) Region

我们希望向您提供有关 2025 年 10 月 19 日至 20 日发生在北弗吉尼亚(us-east-1)区域的服务中断的一些额外信息。尽管该事件从 10 月 19 日太平洋夏令时间晚上 11:48 开始,到 10 月 20 日太平洋夏令时间下午 2:20 结束,但对客户应用程序产生了三个不同的影响阶段。首先,在 10 月 19 日晚上 11:48 至 10 月 20 日凌晨 2:40 期间,Amazon DynamoDB 在北弗吉尼亚(us-east-1)区域经历了 API 错误率的增加。其次,在 10 月 20 日早上 5:30 至下午 2:09 期间,网络负载均衡器(NLB)在北弗吉尼亚(us-east-1)区域的一些负载均衡器上经历了连接错误的增加。这是由于 NLB 集群中的健康检查失败引起的,导致某些 NLB 上的连接错误增加。第三,在 10 月 20 日凌晨 2:25 至上午 10:36 期间,新的 EC2 实例启动失败;尽管实例启动从上午 10:37 开始成功,但一些新启动的实例遇到了连接问题,这些问题在下午 1:50 得到解决。

DynamoDB#

在 10 月 19 日太平洋夏令时间晚上 11:48 至 10 月 20 日太平洋夏令时间凌晨 2:40 期间,客户在北弗吉尼亚(us-east-1)区域经历了 Amazon DynamoDB API 错误率的增加。在此期间,依赖 DynamoDB 的客户和其他 AWS 服务无法与该服务建立新连接。该事件是由服务自动化 DNS 管理系统中的一个潜在缺陷触发的,该缺陷导致 DynamoDB 的端点解析失败。

许多最大的 AWS 服务广泛依赖 DNS 来提供无缝扩展、故障隔离和恢复、低延迟和本地化。像 DynamoDB 这样的服务维护着数十万个 DNS 记录,以便在每个区域运营一个非常庞大且异构的负载均衡器集群。自动化对于确保这些 DNS 记录得到频繁更新至关重要,以便在可用时增加额外容量、正确处理硬件故障以及高效分配流量以优化客户体验。这种自动化设计具有弹性,允许服务从各种操作问题中恢复。除了提供公共区域端点外,此自动化还维护着针对几种动态 DynamoDB 变体的额外 DNS 端点,包括 FIPS 合规端点、IPv6 端点和账户特定端点。此问题的根本原因是 DynamoDB DNS 管理系统中存在一个潜在的竞态条件,导致服务区域端点(dynamodb.us-east-1.amazonaws.com)的 DNS 记录不正确地为空,而自动化未能修复。为了解释此事件,我们需要分享一些关于 DynamoDB DNS 管理架构的细节。出于可用性原因,系统分为两个独立的组件。第一个组件是 DNS 规划器(DNS Planner),它监控负载均衡器的健康状况和容量,并定期为服务的每个端点创建新的 DNS 计划,该计划由一组负载均衡器和权重组成。我们生成单个区域 DNS 计划,因为当容量在多个端点之间共享时(例如最近推出的 IPv6 端点和公共区域端点),这极大地简化了容量管理和故障缓解。第二个组件是 DNS 执行器(DNS Enactor),它设计为具有最小依赖性,以便在任何情况下都能进行系统恢复,它通过在 Amazon Route53 服务中应用所需的更改来执行 DNS 计划。为了提高弹性,DNS 执行器在三个不同的可用区(AZ)中冗余且完全独立地运行。这些独立的 DNS 执行器实例中的每一个都会查找新计划,并尝试通过使用 Route53 事务将当前计划替换为新计划来更新 Route53,确保即使多个 DNS 执行器并发尝试更新,每个端点也会使用一致的计划进行更新。竞态条件涉及两个 DNS 执行器之间不太可能发生的交互。在正常情况下,一个 DNS 执行器会获取最新计划并开始处理服务端点以应用此计划。此过程通常会迅速完成,并有效地保持 DNS 状态的最新更新。在开始应用新计划之前,DNS 执行器会进行一次性检查,确保其计划比先前应用的计划新。当 DNS 执行器遍历端点列表时,它可能会在尝试事务时遇到延迟,并被另一个正在更新同一端点的 DNS 执行器阻塞。在这些情况下,DNS 执行器将重试每个端点,直到计划成功应用于所有端点。就在此事件开始之前,一个 DNS 执行器经历了异常高的延迟,需要重试其对几个 DNS 端点的更新。当它缓慢地处理这些端点时,还发生了其他几件事。首先,DNS 规划器继续运行并生成了许多更新一代的计划。其次,另一个 DNS 执行器随后开始应用其中一个更新的计划,并迅速完成了所有端点的更新。这些事件的时序触发了潜在的竞态条件。当第二个执行器(应用最新计划)完成其端点更新时,它随后调用了计划清理过程,该过程识别比它刚应用的计划显著更旧的计划并将其删除。在调用此清理过程的同时,第一个执行器(经历了异常延迟)将其更旧的计划应用于区域 DDB 端点,覆盖了更新的计划。此时,在计划应用过程开始时进行的检查(确保计划比先前应用的计划新)由于执行器处理中的异常高延迟而变得不新鲜。因此,这并没有阻止旧计划覆盖新计划。第二个执行器的清理过程随后删除了这个旧计划,因为它比它刚应用的计划旧了许多代。当此计划被删除时,区域端点的所有 IP 地址立即被移除。此外,由于活动计划被删除,系统处于不一致状态,阻止了任何 DNS 执行器应用后续的计划更新。这种情况最终需要手动操作员干预才能纠正。

当此问题在太平洋夏令时间晚上 11:48 发生时,所有需要通过公共端点连接到北弗吉尼亚(us-east-1)区域的 DynamoDB 服务的系统立即开始经历 DNS 失败并无法连接到 DynamoDB。这包括客户流量以及依赖 DynamoDB 的内部 AWS 服务的流量。拥有 DynamoDB 全局表的客户能够成功连接到其他区域的副本表并发出请求,但经历了到北弗吉尼亚(us-east-1)区域副本表以及从该副本表复制的长时间延迟。受影响的 AWS 服务的工程团队立即参与并开始调查。到 10 月 20 日凌晨 12:38,我们的工程师已确定 DynamoDB 的 DNS 状态是中断的根源。到凌晨 1:15,应用的临时缓解措施使一些内部服务能够连接到 DynamoDB,并修复了关键的内部工具,从而为进一步的恢复扫清了障碍。到凌晨 2:25,所有 DNS 信息都已恢复,所有全局表副本在凌晨 2:32 完全赶上。随着缓存的 DNS 记录在凌晨 2:25 到凌晨 2:40 之间过期,客户能够解析 DynamoDB 端点并建立成功的连接。这完成了主要服务中断事件的恢复。

Amazon EC2#

在 10 月 19 日太平洋夏令时间晚上 11:48 至 10 月 20 日太平洋夏令时间下午 1:50 期间,客户在北弗吉尼亚(us-east-1)区域经历了 EC2 API 错误率、延迟和实例启动失败的增加。在事件开始之前启动的现有 EC2 实例保持健康,并且在事件持续期间没有受到任何影响。在太平洋夏令时间凌晨 2:25 解决 DynamoDB DNS 问题后,客户继续看到新实例启动的错误增加。恢复从太平洋夏令时间下午 12:01 开始,EC2 完全恢复在太平洋夏令时间下午 1:50 发生。在此期间,新实例启动失败,并返回“请求限制超出”或“容量不足”错误。

为了了解发生了什么,我们需要分享一些关于用于管理 EC2 实例启动以及为新启动的 EC2 实例配置网络连接的几个子系统的信息。第一个子系统是 DropletWorkflow Manager(DWFM),它负责管理 EC2 用于托管 EC2 实例的所有底层物理服务器——我们称这些服务器为“droplets”。第二个子系统是 Network Manager,它负责管理网络状态并将其传播到所有 EC2 实例和网络设备。每个 DWFM 管理每个可用区内的一组 droplet,并维护当前管理的每个 droplet 的租约。此租约允许 DWFM 跟踪 droplet 状态,确保来自 EC2 API 或 EC2 实例本身的所有操作(例如源自 EC2 实例操作系统的关机或重启操作)都会在更广泛的 EC2 系统内导致正确的状态更改。作为维护此租约的一部分,每个 DWFM 主机必须每隔几分钟与其管理的每个 droplet 进行签入并完成状态检查。

从 10 月 19 日太平洋夏令时间晚上 11:48 开始,这些 DWFM 状态检查开始失败,因为该过程依赖于 DynamoDB 并且无法完成。虽然这没有影响任何正在运行的 EC2 实例,但它确实导致 droplet 需要与 DWFM 建立新的租约,然后才能发生其托管的 EC2 实例的进一步实例状态更改。在 10 月 19 日晚上 11:48 至 10 月 20 日凌晨 2:24 之间,EC2 集群内 DWFM 与 droplet 之间的租约开始缓慢超时。

在太平洋夏令时间凌晨 2:25,随着 DynamoDB API 的恢复,DWFM 开始与整个 EC2 集群中的 droplet 重新建立租约。由于任何没有活动租约的 droplet 不被视为新 EC2 启动的候选对象,因此 EC2 API 对传入的新 EC2 启动请求返回“容量不足错误”。DWFM 开始了与整个 EC2 集群中的 droplet 重新建立租约的过程;然而,由于 droplet 数量庞大,建立新 droplet 租约的工作耗时太长,无法在它们超时之前完成。额外的任务被排队以重新尝试建立 droplet 租约。此时,DWFM 进入了拥塞崩溃状态,无法在恢复 droplet 租约方面取得进展。由于这种情况没有既定的操作恢复程序,工程师们小心翼翼地尝试在不引起进一步问题的情况下解决 DWFM 的问题。在尝试了多个缓解步骤后,工程师在凌晨 4:14 限制了传入的工作,并开始选择性地重启 DWFM 主机以从这种情况中恢复。重启 DWFM 主机清除了 DWFM 队列,减少了处理时间,并允许建立 droplet 租约。到凌晨 5:28,DWFM 已与北弗吉尼亚(us-east-1)区域内的所有 droplet 建立了租约,新启动再次开始成功,尽管由于为了减少总体请求负载而引入的请求限制,许多请求仍在看到“请求限制超出”错误。

当启动新的 EC2 实例时,一个名为 Network Manager 的系统会传播网络配置,允许该实例与同一虚拟私有云(VPC)内的其他实例、其他 VPC 网络设备和互联网进行通信。在太平洋夏令时间凌晨 5:28,即 DWFM 恢复后不久,Network Manager 开始向新启动的实例和在事件期间终止的实例传播更新的网络配置。由于这些网络传播事件因 DWFM 的问题而延迟,北弗吉尼亚(us-east-1)区域内的 Network Manager 需要处理大量的网络状态传播积压。结果,在凌晨 6:21,Network Manager 开始经历网络传播时间的增加,因为它正在努力处理网络状态更改的积压。虽然新的 EC2 实例可以成功启动,但由于网络状态传播的延迟,它们将没有必要的网络连接。工程师们努力减少 Network Manager 的负载,以解决网络配置传播时间问题,并采取行动加速恢复。到上午 10:36,网络配置传播时间已恢复到正常水平,新的 EC2 实例启动再次正常运行。

EC2 恢复的最后一步是完全移除为减少各种 EC2 子系统负载而设置的请求限制。随着 API 调用和新的 EC2 实例启动请求趋于稳定,我们的工程师在太平洋夏令时间上午 11:23 开始放松请求限制,以实现完全恢复。到下午 1:50,所有 EC2 API 和新的 EC2 实例启动都正常运行。

网络负载均衡器(NLB)#

新启动 EC2 实例的网络状态传播延迟也对网络负载均衡器(NLB)服务和使用 NLB 的 AWS 服务造成了影响。在 10 月 20 日太平洋夏令时间早上 5:30 至下午 2:09 期间,一些客户在北弗吉尼亚(us-east-1)区域的 NLB 上经历了连接错误的增加。NLB 构建在高度可扩展的多租户架构之上,该架构提供负载均衡端点并将流量路由到后端目标,这些目标通常是 EC2 实例。该架构还利用一个独立的健康检查子系统,该子系统定期对 NLB 架构内的所有节点执行健康检查,并将任何被认为不健康的节点从服务中移除。

在事件期间,NLB 健康检查子系统开始经历健康检查失败的增加。这是由于健康检查子系统在那些实例的网络状态尚未完全传播时将新的 EC2 实例投入使用造成的。这意味着在某些情况下,即使底层 NLB 节点和后端目标是健康的,健康检查也会失败。这导致健康检查在失败和健康之间交替。这导致 NLB 节点和后端目标从 DNS 中移除,只有在下一次健康检查成功时才返回服务。

我们的监控系统在早上 6:52 检测到此问题,工程师们开始努力修复该问题。交替的健康检查结果增加了健康检查子系统的负载,导致其退化,从而导致健康检查延迟并触发自动 AZ DNS 故障转移发生。对于多可用区负载均衡器,这导致容量被移除。在这种情况下,如果剩余的健康容量不足以承载应用程序负载,则应用程序会经历连接错误的增加。在上午 9:36,工程师禁用了 NLB 的自动健康检查故障转移,允许所有可用的健康 NLB 节点和后端目标重新投入使用。这解决了受影响负载均衡器的连接错误增加问题。在 EC2 恢复后不久,我们在下午 2:09 重新启用了自动 DNS 健康检查故障转移。

其他 AWS 服务#

在 10 月 19 日太平洋夏令时间晚上 11:51 至 10 月 20 日太平洋夏令时间下午 2:15 期间,客户在北弗吉尼亚(us-east-1)区域的 Lambda 函数经历了 API 错误和延迟。最初,DynamoDB 端点问题阻止了函数创建和更新,导致 SQS/Kinesis 事件源的处理延迟和调用错误。到凌晨 2:24,除了 SQS 队列处理外,服务操作恢复,SQS 队列处理仍然受到影响,因为负责轮询 SQS 队列的内部子系统失败且未自动恢复。我们在凌晨 4:40 恢复了此子系统,并在早上 6:00 处理了所有消息积压。从早上 7:04 开始,NLB 健康检查失败触发了实例终止,导致一部分 Lambda 内部系统容量不足。由于 EC2 启动仍然受损,我们限制了 Lambda 事件源映射和异步调用,以优先处理对延迟敏感的同步调用。到上午 11:27,足够的容量已恢复,错误消退。然后我们逐渐减少限制,并在下午 2:15 处理了所有积压,服务操作恢复正常。

在 10 月 19 日太平洋夏令时间晚上 11:45 至 10 月 20 日太平洋夏令时间下午 2:20 期间,客户在北弗吉尼亚(us-east-1)区域的 Amazon Elastic Container Service (ECS)、Elastic Kubernetes Service (EKS) 和 Fargate 上经历了容器启动失败和集群扩展延迟。这些服务在下午 2:20 恢复。

在 10 月 19 日太平洋夏令时间晚上 11:56 至 10 月 20 日太平洋夏令时间下午 1:20 期间,Amazon Connect 客户在北弗吉尼亚(us-east-1)区域处理呼叫、聊天和案例时经历了错误增加。在 DynamoDB 端点恢复后,大多数 Connect 功能恢复,但客户继续经历聊天的错误增加,直到凌晨 5:00。从早上 7:04 开始,客户再次经历处理新呼叫、聊天、任务、电子邮件和案例的错误增加,这是由 Connect 使用的 NLB 受到影响以及 Lambda 函数调用错误率和延迟增加引起的。呼入者经历了忙音、错误消息或连接失败。代理发起和 API 发起的呼出呼叫均失败。已接听的呼叫经历了提示播放失败、路由到代理失败或静音。此外,代理在处理联系人时经历了错误增加,一些代理无法登录。客户在访问 API 和联系人搜索时也面临错误增加。实时、历史仪表板和数据湖数据更新延迟,所有数据将在 10 月 28 日前回填。随着 Lambda 函数调用错误的恢复,服务可用性在下午 1:20 恢复。

在 10 月 19 日太平洋夏令时间晚上 11:51 至上午 9:59 期间,客户在北弗吉尼亚(us-east-1)区域经历了 AWS 安全令牌服务(STS)API 错误和延迟。在内部 DynamoDB 端点恢复后,STS 在凌晨 1:19 恢复。在早上 8:31 至上午 9:59 期间,由于 NLB 健康检查失败,STS API 错误率和延迟再次增加。到上午 9:59,我们从 NLB 健康检查失败中恢复,服务开始正常运行。

在 10 月 19 日太平洋夏令时间晚上 11:51 至 10 月 20 日太平洋夏令时间凌晨 1:25 期间,尝试使用 IAM 用户登录 AWS 管理控制台的 AWS 客户经历了身份验证失败的增加,原因是北弗吉尼亚(us-east-1)区域底层 DynamoDB 问题。在北弗吉尼亚(us-east-1)区域配置了 IAM Identity Center 的客户也无法使用 Identity Center 登录。使用其根凭证的客户,以及配置为使用 signin.aws.amazon.com 的身份联合客户,在尝试登录北弗吉尼亚(us-east-1)区域以外区域的 AWS 管理控制台时经历了错误。随着 DynamoDB 端点在凌晨 1:25 变得可访问,服务开始正常运行。

在 10 月 19 日太平洋夏令时间晚上 11:47 至 10 月 20 日太平洋夏令时间凌晨 2:21 期间,客户在北弗吉尼亚(us-east-1)区域创建和修改 Redshift 集群或对现有集群发出查询时经历了 API 错误。Redshift 查询处理依赖于 DynamoDB 端点来读取和写入集群中的数据。随着 DynamoDB 端点恢复,Redshift 查询操作恢复,到凌晨 2:21,Redshift 客户成功查询集群并创建和修改集群配置。然而,在 DynamoDB 端点恢复正常操作后,一些 Redshift 计算集群仍然受损且无法查询。由于集群节点的凭证到期而未刷新,Redshift 自动化会触发工作流以用新实例替换底层 EC2 主机。由于 EC2 启动受损,这些工作流被阻塞,使集群处于“修改中”状态,从而阻止查询处理并使集群无法用于工作负载。在早上 6:45,我们的工程师采取行动阻止工作流积压继续增长,当 Redshift 集群在下午 2:46 开始启动替换实例时,工作流积压开始消除。到 10 月 21 日太平洋夏令时间凌晨 4:05,AWS 操作员完成了为受替换工作流影响的集群恢复可用性。除了集群可用性受损外,在 10 月 19 日晚上 11:47 至 10 月 20 日凌晨 1:20 期间,所有 AWS 区域的 Amazon Redshift 客户都无法使用 IAM 用户凭证执行查询,原因是 Redshift 缺陷使用了北弗吉尼亚(us-east-1)区域的 IAM API 来解析用户组。结果,在此期间 IAM 的受损导致 Redshift 无法执行这些查询。使用“本地”用户连接到其 Redshift 集群的 AWS 区域的 Redshift 客户未受影响。

在 10 月 19 日太平洋夏令时间晚上 11:48 至 10 月 20 日太平洋夏令时间凌晨 2:40 期间,客户无法通过 AWS Support Console 和 API 创建、查看和更新支持案例。虽然支持中心按设计成功故障转移到另一个区域,但负责账户元数据的子系统开始提供响应,阻止合法用户访问 AWS Support Center。虽然我们设计了支持中心以在响应不成功时绕过此系统,但在本次事件中,此子系统返回了无效响应。这些无效响应导致系统意外地阻止合法用户访问支持案例功能。该问题在凌晨 2:40 得到缓解,我们采取了额外措施以防止在凌晨 2:58 复发。

依赖 DynamoDB、新 EC2 实例启动、Lambda 调用和 Fargate 任务启动的其他 AWS 服务,例如 Managed Workflows for Apache Airflow 和 Outposts 生命周期操作,也在北弗吉尼亚(us-east-1)区域受到影响。有关受影响服务的完整列表,请参阅事件历史记录。

由于此操作事件,我们正在进行几项更改。我们已在全球范围内禁用了 DynamoDB DNS 规划器和 DNS 执行器自动化。在重新启用此自动化之前,我们将修复竞态条件场景并添加额外的保护措施,以防止应用不正确的 DNS 计划。对于 NLB,我们正在添加一个速度控制机制,以限制当健康检查失败导致 AZ 故障转移时单个 NLB 可以移除的容量。对于 EC2,我们正在构建一个额外的测试套件来增强我们现有的规模测试,该套件将演练 DWFM 恢复工作流以识别未来的任何回归。我们将改进 EC2 数据传播系统中的限制机制,根据等待队列的大小来限制传入的工作,以在负载高期间保护服务。最后,随着我们继续处理所有 AWS 服务的此事件的细节,我们将寻找其他方法来避免未来类似事件的影响,以及如何进一步减少恢复时间。

总结#

对于此事件给我们的客户带来的影响,我们深表歉意。尽管我们在以最高水平的可用性运营我们的服务方面拥有良好的记录,但我们知道我们的服务对我们的客户、他们的应用程序和最终用户以及他们的业务至关重要。我们知道此事件以重大方式影响了许多客户。我们将竭尽所能从本次事件中吸取教训,并利用它来进一步提高我们的可用性。


We wanted to provide you with some additional information about the service disruption that occurred in the N. Virginia (us-east-1) Region on October 19 and 20, 2025. While the event started at 11:48 PM PDT on October 19 and ended at 2:20 PM PDT on October 20, there were three distinct periods of impact to customer applications. First, between 11:48 PM on October 19 and 2:40 AM on October 20, Amazon DynamoDB experienced increased API error rates in the N. Virginia (us-east-1) Region. Second, between 5:30 AM and 2:09 PM on October 20, Network Load Balancer (NLB) experienced increased connection errors for some load balancers in the N. Virginia (us-east-1) Region. This was caused by health check failures in the NLB fleet, which resulted in increased connection errors on some NLBs. Third, between 2:25 AM and 10:36 AM on October 20, new EC2 instance launches failed and, while instance launches began to succeed from 10:37 AM, some newly launched instances experienced connectivity issues which were resolved by 1:50 PM.

DynamoDB

Between 11:48 PM PDT on October 19 and 2:40 AM PDT on October 20, customers experienced increased Amazon DynamoDB API error rates in the N. Virginia (us-east-1) Region. During this period, customers and other AWS services with dependencies on DynamoDB were unable to establish new connections to the service. The incident was triggered by a latent defect within the service’s automated DNS management system that caused endpoint resolution failures for DynamoDB.

Many of the largest AWS services rely extensively on DNS to provide seamless scale, fault isolation and recovery, low latency, and locality. Services like DynamoDB maintain hundreds of thousands of DNS records to operate a very large heterogeneous fleet of load balancers in each Region. Automation is crucial to ensuring that these DNS records are updated frequently to add additional capacity as it becomes available, to correctly handle hardware failures, and to efficiently distribute traffic to optimize customers’ experience. This automation has been designed for resilience, allowing the service to recover from a wide variety of operational issues. In addition to providing a public regional endpoint, this automation maintains additional DNS endpoints for several dynamic DynamoDB variants including a FIPS compliant endpoint, an IPv6 endpoint, and account-specific endpoints. The root cause of this issue was a latent race condition in the DynamoDB DNS management system that resulted in an incorrect empty DNS record for the service’s regional endpoint (dynamodb.us-east-1.amazonaws.com) that the automation failed to repair. To explain this event, we need to share some details about the DynamoDB DNS management architecture. The system is split across two independent components for availability reasons. The first component, the DNS Planner, monitors the health and capacity of the load balancers and periodically creates a new DNS plan for each of the service’s endpoints consisting of a set of load balancers and weights. We produce a single regional DNS plan, as this greatly simplifies capacity management and failure mitigation when capacity is shared across multiple endpoints, as is the case with the recently launched IPv6 endpoint and the public regional endpoint. A second component, the DNS Enactor, which is designed to have minimal dependencies to allow for system recovery in any scenario, enacts DNS plans by applying the required changes in the Amazon Route53 service. For resiliency, the DNS Enactor operates redundantly and fully independently in three different Availability Zones (AZs). Each of these independent instances of the DNS Enactor looks for new plans and attempts to update Route53 by replacing the current plan with a new plan using a Route53 transaction, assuring that each endpoint is updated with a consistent plan even when multiple DNS Enactors attempt to update it concurrently. The race condition involves an unlikely interaction between two of the DNS Enactors. The normal way things work a DNS Enactor picks up the latest plan and begins working through the service endpoints to apply this plan. This process typically completes rapidly and does an effective job of keeping DNS state freshly updated. Before it begins to apply a new plan, the DNS Enactor makes a one-time check that its plan is newer than the previously applied plan. As the DNS Enactor makes its way through the list of endpoints, it is possible to encounter delays as it attempts a transaction and is blocked by another DNS Enactor updating the same endpoint. In these cases, the DNS Enactor will retry each endpoint until the plan is successfully applied to all endpoints. Right before this event started, one DNS Enactor experienced unusually high delays needing to retry its update on several of the DNS endpoints. As it was slowly working through the endpoints, several other things were also happening. First, the DNS Planner continued to run and produced many newer generations of plans. Second, one of the other DNS Enactors then began applying one of the newer plans and rapidly progressed through all of the endpoints. The timing of these events triggered the latent race condition. When the second Enactor (applying the newest plan) completed its endpoint updates, it then invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them. At the same time that this clean-up process was invoked, the first Enactor (which had been unusually delayed) applied its much older plan to the regional DDB endpoint, overwriting the newer plan. The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time due to the unusually high delays in Enactor processing. Therefore, this did not prevent the older plan from overwriting the newer plan. The second Enactor’s clean-up process then deleted this older plan because it was many generations older than the plan it had just applied. As this plan was deleted, all IP addresses for the regional endpoint were immediately removed. Additionally, because the active plan was deleted, the system was left in an inconsistent state that prevented subsequent plan updates from being applied by any DNS Enactors. This situation ultimately required manual operator intervention to correct.

When this issue occurred at 11:48 PM PDT, all systems needing to connect to the DynamoDB service in the N. Virginia (us-east-1) Region via the public endpoint immediately began experiencing DNS failures and failed to connect to DynamoDB. This included customer traffic as well as traffic from internal AWS services that rely on DynamoDB. Customers with DynamoDB global tables were able to successfully connect to and issue requests against their replica tables in other Regions, but experienced prolonged replication lag to and from the replica tables in the N. Virginia (us-east-1) Region. Engineering teams for impacted AWS services were immediately engaged and began to investigate. By 12:38 AM on October 20, our engineers had identified DynamoDB’s DNS state as the source of the outage. By 1:15 AM, the temporary mitigations that were applied enabled some internal services to connect to DynamoDB and repaired key internal tooling that unblocked further recovery. By 2:25 AM, all DNS information was restored, and all global tables replicas were fully caught up by 2:32 AM. Customers were able to resolve the DynamoDB endpoint and establish successful connections as cached DNS records expired between 2:25 AM and 2:40 AM. This completed recovery from the primary service disruption event.

Amazon EC2

Between 11:48 PM PDT on October 19 and 1:50 PM PDT on October 20, customers experienced increased EC2 API error rates, latencies, and instance launch failures in the N. Virginia (us-east-1) Region. Existing EC2 instances that had been launched prior to the start of the event remained healthy and did not experience any impact for the duration of the event. After resolving the DynamoDB DNS issue at 2:25 AM PDT, customers continued to see increased errors for launches of new instances. Recovery started at 12:01 PM PDT with full EC2 recovery occurring at 1:50 PM PDT. During this period new instance launches failed with either a “request limit exceeded” or “insufficient capacity” error.

To understand what happened, we need to share some information about a few subsystems that are used for the management of EC2 instance launches, as well as for configuring network connectivity for newly launched EC2 instances. The first subsystem is DropletWorkflow Manager (DWFM), which is responsible for the management of all the underlying physical servers that are used by EC2 for the hosting of EC2 instances – we call these servers “droplets”. The second subsystem is Network Manager, which is responsible for the management and propagation of network state to all EC2 instances and network appliances. Each DWFM manages a set of droplets within each Availability Zone and maintains a lease for each droplet currently under management. This lease allows DWFM to track the droplet state, ensuring that all actions from the EC2 API or within the EC2 instance itself, such as shutdown or reboot operations originating from the EC2 instance operating system, result in the correct state changes within the broader EC2 systems. As part of maintaining this lease, each DWFM host has to check in and complete a state check with each droplet that it manages every few minutes.

Starting at 11:48 PM PDT on October 19, these DWFM state checks began to fail as the process depends on DynamoDB and was unable to complete. While this did not affect any running EC2 instance, it did result in the droplet needing to establish a new lease with a DWFM before further instance state changes could happen for the EC2 instances it is hosting. Between 11:48 PM on October 19 and 2:24 AM on October 20, leases between DWFM and droplets within the EC2 fleet slowly started to time out.

At 2:25 AM PDT, with the recovery of the DynamoDB APIs, DWFM began to re-establish leases with droplets across the EC2 fleet. Since any droplet without an active lease is not considered a candidate for new EC2 launches, the EC2 APIs were returning “insufficient capacity errors” for new incoming EC2 launch requests. DWFM began the process of reestablishing leases with droplets across the EC2 fleet; however, due to the large number of droplets, efforts to establish new droplet leases took long enough that the work could not be completed before they timed out. Additional work was queued to reattempt establishing the droplet lease. At this point, DWFM had entered a state of congestive collapse and was unable to make forward progress in recovering droplet leases. Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with DWFM without causing further issues. After attempting multiple mitigation steps, at 4:14 AM engineers throttled incoming work and began selective restarts of DWFM hosts to recover from this situation. Restarting the DWFM hosts cleared out the DWFM queues, reduced processing times, and allowed droplet leases to be established. By 5:28 AM, DWFM had established leases with all droplets within the N. Virginia (us-east-1) Region and new launches were once again starting to succeed, although many requests were still seeing “request limit exceeded” errors due to the request throttling that had been introduced to reduce overall request load.

When a new EC2 instance is launched, a system called Network Manager propagates the network configuration that allows the instance to communicate with other instances within the same Virtual Private Cloud (VPC), other VPC network appliances, and the Internet. At 5:28 AM PDT, shortly after the recovery of DWFM, Network Manager began propagating updated network configurations to newly launched instances and instances that had been terminated during the event. Since these network propagation events had been delayed by the issue with DWFM, a significant backlog of network state propagations needed to be processed by Network Manager within the N. Virginia (us-east-1) Region. As a result, at 6:21 AM, Network Manager started to experience increased latencies in network propagation times as it worked to process the backlog of network state changes. While new EC2 instances could be launched successfully, they would not have the necessary network connectivity due to the delays in network state propagation. Engineers worked to reduce the load on Network Manager to address network configuration propagation times and took action to accelerate recovery. By 10:36 AM, network configuration propagation times had returned to normal levels, and new EC2 instance launches were once again operating normally.

The final step towards EC2 recovery was to fully remove the request throttles that had been put in place to reduce the load on the various EC2 subsystems. As API calls and new EC2 instance launch requests stabilized, at 11:23 AM PDT our engineers began relaxing request throttles as they worked towards full recovery. At 1:50 PM, all EC2 APIs and new EC2 instance launches were operating normally.

Network Load Balancer (NLB)

The delays in network state propagations for newly launched EC2 instances also caused impact to the Network Load Balancer (NLB) service and AWS services that use NLB. Between 5:30 AM and 2:09 PM PDT on October 20 some customers experienced increased connection errors on their NLBs in the N. Virginia (us-east-1) Region. NLB is built on top of a highly scalable, multi-tenant architecture that provides load balancing endpoints and routes traffic to backend targets, which are typically EC2 instances. The architecture also makes use of a separate health check subsystem that regularly executes health checks against all nodes within the NLB architecture and will remove any nodes from service that are considered unhealthy.

During the event the NLB health checking subsystem began to experience increased health check failures. This was caused by the health checking subsystem bringing new EC2 instances into service while the network state for those instances had not yet fully propagated. This meant that in some cases health checks would fail even though the underlying NLB node and backend targets were healthy. This resulted in health checks alternating between failing and healthy. This caused NLB nodes and backend targets to be removed from DNS, only to be returned to service when the next health check succeeded.

Our monitoring systems detected this at 6:52 AM, and engineers began working to remediate the issue. The alternating health check results increased the load on the health check subsystem, causing it to degrade, resulting in delays in health checks and triggering automatic AZ DNS failover to occur. For multi-AZ load balancers, this resulted in capacity being taken out of service. In this case, an application experienced increased connection errors if the remaining healthy capacity was insufficient to carry the application load. At 9:36 AM, engineers disabled automatic health check failovers for NLB, allowing all available healthy NLB nodes and backend targets to be brought back into service. This resolved the increased connection errors to affected load balancers. Shortly after EC2 recovered, we re-enabled automatic DNS health check failover at 2:09 PM.

Other AWS Services

Between October 19 at 11:51 PM PDT and October 20 at 2:15 PM PDT, customers experienced API errors and latencies for Lambda functions in the N. Virginia (us-east-1) Region. Initially, DynamoDB endpoint issues prevented function creation and updates, caused processing delays for SQS/Kinesis event sources and invocation errors. By 2:24 AM, service operations recovered except for SQS queue processing, which remained impacted because an internal subsystem responsible for polling SQS queues failed and did not recover automatically. We restored this subsystem at 4:40 AM and processed all message backlogs by 6:00 AM. Starting at 7:04 AM, NLB health check failures triggered instance terminations leaving a subset of Lambda internal systems under-scaled. With EC2 launches still impaired, we throttled Lambda Event Source Mappings and asynchronous invocations to prioritize latency-sensitive synchronous invocations. By 11:27 AM, sufficient capacity was restored, and errors subsided. We then gradually reduced throttling and processed all backlogs by 2:15 PM, and normal service operations resumed.

Between October 19 at 11:45 PM PDT and October 20 at 2:20 PM PDT, customers experienced container launch failures and cluster scaling delays across both Amazon Elastic Container Service (ECS), Elastic Kubernetes Service (EKS), and Fargate in the N. Virginia (us-east-1) Region. These services were recovered by 2:20 PM.

Between October 19 at 11:56 PM PDT and October 20 at 1:20 PM PDT, Amazon Connect customers experienced elevated errors handling calls, chats, and cases in the N. Virginia (us-east-1) Region. Following the restoration of DynamoDB endpoints, most Connect features recovered except customers continued to experience elevated errors for chats until 5:00 AM. Starting at 7:04 AM, customers again experienced increased errors handling new calls, chats, tasks, emails, and cases, which was caused by impact to the NLBs used by Connect as well as increased error rates and latencies for Lambda function invocations. Inbound callers experienced busy tones, error messages, or failed connections. Both agent-initiated and API-initiated outbound calls failed. Answered calls experienced prompt playback failures, routing failures to agents, or dead-air audio. Additionally, agents experienced elevated errors handling contacts, and some agents were unable to sign in. Customers also faced elevated errors accessing APIs and Contact Search. Real-time, Historical dashboards, and Data Lake data updates were delayed, and all data will be backfilled by October 28. Service availability was restored at 1:20 PM as Lambda function invocation errors recovered.

On October 19, between 11:51 PM and 9:59 AM PDT, customers experienced AWS Security Token Service (STS) API errors and latency in the N. Virginia (us-east-1) Region. STS recovered at 1:19 AM after the restoration of internal DynamoDB endpoints. Between 8:31 AM and 9:59 AM, STS API error rates and latency increased again as a result of NLB health check failures. By 9:59 AM, we recovered from the NLB health check failures, and the service began normal operations.

Between October 19 at 11:51 PM PDT and October 20 at 1:25 AM PDT, AWS customers attempting to sign into the AWS Management Console using an IAM user experienced increased authentication failures due to underlying DynamoDB issues in the N. Virginia (us-east-1) Region. Customers with IAM Identity Center configured in N. Virginia (us-east-1) Region were also unable to sign in using Identity Center. Customers using their root credential, and customers using identity federation configured to use signin.aws.amazon.com experienced errors when trying to log into the AWS Management Console in regions outside of the N. Virginia (us-east-1) Region. As DynamoDB endpoints became accessible at 1:25 AM, the service began normal operations.

Between October 19 at 11:47 PM PDT and October 20 at 2:21 AM PDT, customers experienced API errors when creating and modifying Redshift clusters or issuing queries against existing clusters in the N. Virginia (us-east-1) Region. Redshift query processing relies on DynamoDB endpoints to read and write data from clusters. As DynamoDB endpoints recovered, Redshift query operations resumed and by 2:21 AM, Redshift customers were successfully querying clusters as well as creating and modifying cluster configurations. However, some Redshift compute clusters remained impaired and unavailable for querying after the DynamoDB endpoints were restored to normal operations. As credentials expire for cluster nodes without being refreshed, Redshift automation triggers workflows to replace the underlying EC2 hosts with new instances. With EC2 launches impaired, these workflows were blocked, putting clusters in a “modifying” state that prevented query processing and making the cluster unavailable for workloads. At 6:45 AM, our engineers took action to stop the workflow backlog from growing and when Redshift clusters started to launch replacement instances at 2:46 PM, the backlog of workflows began draining. By 4:05 AM PDT October 21, AWS operators completed restoring availability for clusters impaired by replacement workflows. In addition to cluster availability impairment, between October 19 at 11:47 PM and October 20 at 1:20 AM, Amazon Redshift customers in all AWS Regions were unable to use IAM user credentials for executing queries due to a Redshift defect that used an IAM API in the N. Virginia (us-east-1) Region to resolve user groups. As a result, IAM’s impairment during this period caused Redshift to be unable to execute these queries. Redshift customers in AWS Regions who use “local” users to connect to their Redshift clusters were unaffected.

Between October 19 at 11:48 PM PDT and October 20 at 2:40 AM PDT, customers were unable to create, view, and update support cases through the AWS Support Console and API. While the Support Center successfully failed over to another region as designed, a subsystem responsible for account metadata began providing responses that prevented legitimate users from accessing the AWS Support Center. While we have designed the Support Center to bypass this system if responses were unsuccessful, in this event, this subsystem was returning invalid responses. These invalid responses resulted in the system unexpectedly blocking legitimate users from accessing support case functions. The issue was mitigated at 2:40 AM, and we took additional actions to prevent recurrence at 2:58 AM.

Other AWS services that rely on DynamoDB, new EC2 instance launches, Lambda invocations, and Fargate task launches such as Managed Workflows for Apache Airflow, and Outposts lifecycle operations were also impacted in the N. Virginia (us-east-1) Region. Refer to the event history for the full list of impacted services.

We are making several changes as a result of this operational event. We have already disabled the DynamoDB DNS Planner and the DNS Enactor automation worldwide. In advance of re-enabling this automation, we will fix the race condition scenario and add additional protections to prevent the application of incorrect DNS plans. For NLB, we are adding a velocity control mechanism to limit the capacity a single NLB can remove when health check failures cause AZ failover. For EC2, we are building an additional test suite to augment our existing scale testing, which will exercise the DWFM recovery workflow to identify any future regressions. We will improve the throttling mechanism in our EC2 data propagation systems to rate limit incoming work based on the size of the waiting queue to protect the service during periods of high load. Finally, as we continue to work through the details of this event across all AWS services, we will look for additional ways to avoid impact from a similar event in the future, and how to further reduce time to recovery.

In closing

We apologize for the impact this event caused our customers. While we have a strong track record of operating our services with the highest levels of availability, we know how critical our services are to our customers, their applications and end users, and their businesses. We know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further.

Summary of the Amazon DynamoDB Service Disruption in the Northern Virginia (US-EAST-1) Region
https://blog.282994.xyz/posts/summary-of-the-amazon-dynamodb-service-disruption-in-the-northern-virginia-us-east-1-region/
Author
Rock
Published at
2025-10-20
License
CC BY-NC-SA 4.0