
2025 Cloud Outages: What the Year’s Biggest Failures Taught Leaders

If you ran on AWS, Azure, or Google Cloud in 2025, you likely noticed it. Outages didn’t stay contained to one service or one layer. A regional disruption, an authorization failure, or an edge provider outage could break logins, stall APIs, and freeze transactions even when internal dashboards looked normal.
2025 also made one thing harder to ignore: today’s cloud services extend beyond a single cloud account. Your applications still depend on DNS, identity and authorization, CDNs, carrier routes, SaaS APIs, and cloud backbones. When one of those external dependencies degrades, customers often feel it first.
This post reviews three major 2025 incidents and the pattern they share. You’ll learn what failed, how the failure reached customer workflows, and what leaders should change in monitoring and incident response. You’ll also see how WanAware adds external dependency visibility so teams can pinpoint the failing provider or path segment and communicate impact fast.
Outage #1: The AWS US-EAST-1 Disruption That Cascaded Across Core Services
On Oct 19–20, 2025, AWS reported a service disruption in Northern Virginia (US-EAST-1) that began with Amazon DynamoDB endpoint resolution failures. A defect in DynamoDB’s automated DNS management produced an empty DNS record for the regional endpoint, which prevented systems from resolving the service and establishing new connections.
From a customer perspective, this did not look like a single-service problem. Teams saw symptoms spread across critical workflows, including sign-in issues and downstream service errors, even when many internal signals looked normal.
AWS described multiple impact windows tied to DynamoDB connectivity, EC2 launch workflows, and Network Load Balancer behavior. Customers experienced increased DynamoDB API error rates, failures to establish new connections, and follow-on effects that included EC2 API errors and instance launch failures. AWS also reported increased connection errors for some Network Load Balancers during the event.
Leadership takeaway: when your monitoring ends at your cloud boundary, it can be hard to explain why customers fail first. You need visibility into the dependencies and resolution paths that connect users and services to managed endpoints.
Where WanAware helps is on the customer path. WanAware can map which upstream dependency is failing, how that failure impacts specific workflows, and which users or regions see the most impact. That supports faster escalation and clearer stakeholder communication because you can point to evidence, not guesses.
Outage #2: The Google Cloud IAM Disruption That Broke Authorization Upstream
On June 12, 2025, Google Cloud reported a major incident triggered by an invalid automated quota update to its API management system. The faulty change caused many external API requests to be rejected, which disrupted workloads that rely on Google Cloud services for authorization checks.
For many teams, the failure pattern looked like an application outage even when their own compute and storage were not the root cause. Users could not reliably complete sign-in or access flows when upstream authorization failed, because services could not determine what authenticated users or services were allowed to do.
Operational lesson: identity and authorization sit on the critical path. When they degrade, customer workflows can fail quickly, and internal health metrics may not tell you where the failure started.
Where WanAware helps is on the customer path. WanAware can show when a failure begins upstream, which dependency is rejecting requests, which paths are affected, and which workflows are actually breaking. That turns incident response into evidence-driven triage instead of internal guesswork.
Outage #3: The Cloudflare Edge Degradation That Slowed and Failed Requests
On Nov 18, 2025, Cloudflare published an incident postmortem describing a major outage that caused HTTP errors and significant increases in CDN response latency. For many organizations, requests slowed or failed before they reliably reached the origin. Customer workflows broke even when the application itself was not the first system to fail.
Cloudflare tied the trigger to a latent bug in Bot Management configuration generation that hit limits in their proxy engine. During the impact window, debugging and observability systems also added processing that increased CPU contention and contributed to higher latency at the edge.
Leader lesson: your CDN and edge security layer is part of your uptime. When it degrades, users experience timeouts, slow loads, and failed sessions, and internal dashboards can stay calm while the edge path fails upstream.
Where WanAware helps is on the customer path. WanAware can show when failures begin upstream of your services, which external dependency is degrading, and which paths and workflows see the worst impact. That makes triage and communication faster because you can point to evidence, not speculation.
What These Outages Had in Common
In each of the incidents above, the initiating failure occurred outside the customer’s application code and internal infrastructure.
Customers still felt the consequences, but many teams lacked a fast way to answer the first question leaders ask during an incident: where did the customer path start failing?
Many teams can map internal dependencies like “service A calls service B.” Fewer teams can map the external dependencies that shape real customer experience, such as:
- this region relies on that DNS chain
- this workflow depends on that authorization service
- this customer cohort routes through that ISP or carrier
- this API journey touches upstream SaaS services
- this login path relies on that edge network
That external dependency map is what many teams do not model in a way they can use during an incident. WanAware surfaces and operationalizes it.
What Leaders Should Change After 2025
2025 made the reliability conversation more concrete. Here are three changes leaders can make now:
- Expand your “system boundary” to include external dependencies.
Your real architecture includes identity and authorization, DNS, CDNs, carriers, cloud backbones, SaaS partners, and upstream APIs. Many teams still don’t monitor these dependencies end to end. - Start incident response on the customer path, not inside the cluster.
Many incidents begin in shared services, third-party layers, or internet paths. Internal dashboards can stay calm while users hit failures. - Update SLOs and incident comms to reflect the real transaction path.
If critical workflows depend on third-party systems and network paths, reliability can’t be defined only inside your VPC or cluster. Measure and report reliability across the full path users must traverse.
How WanAware Helps During Cloud Outages
WanAware does not replace your observability stack. It completes it by adding external dependency visibility outside your boundary.
During cloud outages, WanAware can provide:
- dependency maps showing which external systems are failing
- triage support that helps isolate where issues start and reduce noise
- carrier and routing diagnostics indicating whether user paths are impaired
- upstream service visibility tied directly to user symptoms
- impact analysis showing which services, regions, and workflows are affected
- evidence for escalation to cloud, SaaS, or network providers
- clearer communication to executives and customers
This changes the incident conversation. Instead of starting with “What is wrong with our system,” teams can start with “Which external dependency is failing, and which workflows does it break?” That reduces wasted effort and helps teams act faster.
The Future of Reliability After 2025
The takeaway from 2025 is pretty clear: your dashboards don’t show the whole story. Often, the problem is somewhere between the user and your app, not inside your cluster.
Going into 2026, the teams that get faster at incidents will be the ones who track those outside dependencies, plan for internet-path failures, and have a way to prove where the break starts.
That’s the lane WanAware is built for. It helps you see the external path so you can stop guessing and respond with evidence.
See your external dependencies in real time.
Start a free trial and map the customer path your dashboards can’t show.
