2025 Cloud Outages: What the Year’s Biggest Failures Taught Leaders

If you ran on AWS, Azure, or Google Cloud in 2025, you likely noticed it. Outages didn’t stay contained to one service or one layer. A regional disruption, an authorization failure, or an edge provider outage could break logins, stall APIs, and freeze transactions even when internal dashboards looked normal.

2025 also made one thing harder to ignore: today’s cloud services extend beyond a single cloud account. Your applications still depend on DNS, identity and authorization, CDNs, carrier routes, SaaS APIs, and cloud backbones. When one of those external dependencies degrades, customers often feel it first.

This post reviews three major 2025 incidents and the pattern they share. You’ll learn what failed, how the failure reached customer workflows, and what leaders should change in monitoring and incident response. You’ll also see how WanAware adds external dependency visibility so teams can pinpoint the failing provider or path segment and communicate impact fast.

‍

Outage #1: The AWS US-EAST-1 Disruption That Cascaded Across Core Services

‍

On Oct 19–20, 2025, AWS reported a service disruption in Northern Virginia (US-EAST-1) that began with Amazon DynamoDB endpoint resolution failures. A defect in DynamoDB’s automated DNS management produced an empty DNS record for the regional endpoint, which prevented systems from resolving the service and establishing new connections.

From a customer perspective, this did not look like a single-service problem. Teams saw symptoms spread across critical workflows, including sign-in issues and downstream service errors, even when many internal signals looked normal.

AWS described multiple impact windows tied to DynamoDB connectivity, EC2 launch workflows, and Network Load Balancer behavior. Customers experienced increased DynamoDB API error rates, failures to establish new connections, and follow-on effects that included EC2 API errors and instance launch failures. AWS also reported increased connection errors for some Network Load Balancers during the event.

‍

When your monitoring ends at your cloud boundary, customers can fail before your dashboards show you why.

‍

Leadership takeaway: when your monitoring ends at your cloud boundary, it can be hard to explain why customers fail first. You need visibility into the dependencies and resolution paths that connect users and services to managed endpoints.

Where WanAware helps is on the customer path. WanAware can map which upstream dependency is failing, how that failure impacts specific workflows, and which users or regions see the most impact. That supports faster escalation and clearer stakeholder communication because you can point to evidence, not guesses.

‍

Outage #2: The Google Cloud IAM Disruption That Broke Authorization Upstream

‍

On June 12, 2025, Google Cloud reported a major incident triggered by an invalid automated quota update to its API management system. The faulty change caused many external API requests to be rejected, which disrupted workloads that rely on Google Cloud services for authorization checks.

For many teams, the failure pattern looked like an application outage even when their own compute and storage were not the root cause. Users could not reliably complete sign-in or access flows when upstream authorization failed, because services could not determine what authenticated users or services were allowed to do.

Operational lesson: identity and authorization sit on the critical path. When they degrade, customer workflows can fail quickly, and internal health metrics may not tell you where the failure started.

Where WanAware helps is on the customer path. WanAware can show when a failure begins upstream, which dependency is rejecting requests, which paths are affected, and which workflows are actually breaking. That turns incident response into evidence-driven triage instead of internal guesswork.

‍

Outage #3: The Cloudflare Edge Degradation That Slowed and Failed Requests

‍

On Nov 18, 2025, Cloudflare published an incident postmortem describing a major outage that caused HTTP errors and significant increases in CDN response latency. For many organizations, requests slowed or failed before they reliably reached the origin. Customer workflows broke even when the application itself was not the first system to fail.

Cloudflare tied the trigger to a latent bug in Bot Management configuration generation that hit limits in their proxy engine. During the impact window, debugging and observability systems also added processing that increased CPU contention and contributed to higher latency at the edge.

Leader lesson: your CDN and edge security layer is part of your uptime. When it degrades, users experience timeouts, slow loads, and failed sessions, and internal dashboards can stay calm while the edge path fails upstream.

Where WanAware helps is on the customer path. WanAware can show when failures begin upstream of your services, which external dependency is degrading, and which paths and workflows see the worst impact. That makes triage and communication faster because you can point to evidence, not speculation.

‍

What These Outages Had in Common

‍

In each of the incidents above, the initiating failure occurred outside the customer’s application code and internal infrastructure.
‍

In each of these incidents, the failure started outside the customer’s code but still broke the customer experience.

Customers still felt the consequences, but many teams lacked a fast way to answer the first question leaders ask during an incident: where did the customer path start failing?

Many teams can map internal dependencies like “service A calls service B.” Fewer teams can map the external dependencies that shape real customer experience, such as:

this region relies on that DNS chain
this workflow depends on that authorization service
this customer cohort routes through that ISP or carrier
this API journey touches upstream SaaS services
this login path relies on that edge network

That external dependency map is what many teams do not model in a way they can use during an incident. WanAware surfaces and operationalizes it.

‍

If the customer path depends on third-party systems, your reliability strategy has to account for them.

‍

What Leaders Should Change After 2025

‍

2025 made the reliability conversation more concrete. Here are three changes leaders can make now:

Expand your “system boundary” to include external dependencies.
Your real architecture includes identity and authorization, DNS, CDNs, carriers, cloud backbones, SaaS partners, and upstream APIs. Many teams still don’t monitor these dependencies end to end.
Start incident response on the customer path, not inside the cluster.
Many incidents begin in shared services, third-party layers, or internet paths. Internal dashboards can stay calm while users hit failures.
Update SLOs and incident comms to reflect the real transaction path.
If critical workflows depend on third-party systems and network paths, reliability can’t be defined only inside your VPC or cluster. Measure and report reliability across the full path users must traverse.

‍

How WanAware Helps During Cloud Outages

‍

WanAware does not replace your observability stack. It completes it by adding external dependency visibility outside your boundary.

During cloud outages, WanAware can provide:

dependency maps showing which external systems are failing
triage support that helps isolate where issues start and reduce noise
carrier and routing diagnostics indicating whether user paths are impaired
upstream service visibility tied directly to user symptoms
impact analysis showing which services, regions, and workflows are affected
evidence for escalation to cloud, SaaS, or network providers
clearer communication to executives and customers

This changes the incident conversation. Instead of starting with “What is wrong with our system,” teams can start with “Which external dependency is failing, and which workflows does it break?” That reduces wasted effort and helps teams act faster.

‍

The Future of Reliability After 2025

‍

The takeaway from 2025 is pretty clear: your dashboards don’t show the whole story. Often, the problem is somewhere between the user and your app, not inside your cluster.

Going into 2026, the teams that get faster at incidents will be the ones who track those outside dependencies, plan for internet-path failures, and have a way to prove where the break starts.

That’s the lane WanAware is built for. It helps you see the external path so you can stop guessing and respond with evidence.

See your external dependencies in real time.
Start a free trial and map the customer path your dashboards can’t show.

‍

Start free trial

‍

Strategic action	Client benefit & ROI	Advisor business payoff
Embed a single asset inventory as the source of truth for all WAN assets	Higher visibility, fewer surprises, smoother operations	Deep stickiness; harder to replace you on price
Proactively monitor circuits and hardware for risk	Issues found and fixed before major outages; fewer SLA penalties	You become the go-to risk and incident partner
Optimize links and contracts using complete asset and usage history	Concrete savings on bandwidth and contracts; cleaner network design	Quantifiable value that justifies your fees
Own the client experience with a white-labeled platform	Consistent experience under your brand; less tool sprawl	Stronger moat around each client relationship
Over time, this becomes a flywheel: better visibility leads to better recommendations, which lead to better outcomes, which make renewals easier and new projects more likely.