How to Get a Refund for AWS’s Oct 20, 2025 Outage – A Cloud Architect’s Advice

2 months ago

Business, Business - Business Strategy, Clouds, Clouds - Cloud Architecture Optimization

On October 20, 2025, Amazon Web Services suffered a massive outage centered on its US-EAST-1 region, causing global disruption to thousands of businesses and applications. In total, the incident lasted roughly 15 hours from the initial fault to full normalization, with the most severe downtime occurring in the first 3–4 hours. This “cloudquake” knocked out major websites (including Amazon’s own retail site), fintech apps, government services, and even smart home devices on an unprecedented scale.

Below, we outline what happened, why it was so impactful – and, most importantly, how your business can claim refunds or credits from AWS for the downtime.

We’ll also share RocketEdge’s recommendations on mitigating future outages, including why we believe Microsoft Azure now offers a more resilient and innovative cloud alternative.

Abstract digital artwork featuring horizontal glitch-like stripes in black, white, pink, green, yellow, and teal, with pixelated, distorted areas throughout, creating a vibrant, chaotic, and fragmented visual effect.

The October 20 AWS Outage: What Happened and Why

Scope & Duration: AWS’s status updates indicate the trouble began just before midnight Pacific Time on Oct 19/20 with increased errors in US-EAST-1. By ~2:00 AM PDT, engineers pinpointed a DNS resolution failure for the DynamoDB service endpoint as the likely root cause. In other words, AWS’s internal Domain Name System – the service that routes requests to the correct servers – broke in N. Virginia, and cascading failures rippled outward. Much of the internet “went dark” for about 3 hours of severe disruption (roughly 06:50–09:24 UTC) until AWS mitigated the DNS issue. Partial recovery followed, but AWS continued throttling some services (like EC2 instance launches) for hours, with full restoration announced around 6:53 PM ET (22:53 UTC) that day. All told, AWS’s core infrastructure was impaired for ~15 hours, and user-visible issues persisted into the next day as backlogs were processed.

Root Cause: AWS confirmed the culprit was a “simple” DNS error in an internal network subsystem. Specifically, the DynamoDB API endpoint in US-EAST-1 couldn’t resolve, essentially “poisoning” the name lookup for a critical database service. Because DynamoDB is a foundational component that many other AWS services (and customer applications) depend on, this one glitch cascaded into widespread failures. In essence, a single point of failure in AWS’s control plane took down multiple subsystems. Analysts noted that AWS’s reliance on its Virginia region for global services (IAM authentication, global DynamoDB tables, CloudFront, etc.) is an Achilles heel – when US-EAST-1 sneezes, the whole cloud catches a cold. AWS later revealed the DNS issue was triggered by a bug in a network load balancer health check subsystem. Once the name resolution broke, servers couldn’t find the right endpoints and “cascading failures took down services across the internet”. This highlights how a hidden dependency in cloud architecture (in this case, a regional DNS service) can unravel resilience measures.

Global Chaos: The outage’s impact was staggeringly global. Downdetector recorded over 6.5 million problem reports from users worldwide, and 1,000+ companies were affected. Major platforms like Reddit, Snapchat, WhatsApp, Fortnite, Zoom, Coinbase, Ring, Alexa, and even banking systems (e.g. Lloyds Bank in the UK) went offline or malfunctioned. In effect, a huge chunk of the internet relied on AWS and went down with it. Even services in other regions felt the pain because so many apps default to using US-EAST-1 for core functions. For many end-users, it was the largest internet disruption since the CrowdStrike outage of 2024. Workers from London to Tokyo were knocked offline, unable to do routine tasks like processing payments or rebooking travel. It became clear that AWS’s promise of regional isolation didn’t help here – the blast radius was global due to centralized services.

Botched Communications: To make matters worse, AWS’s Service Health Dashboard initially gave no warning. For about 75 minutes into the incident, the status page still showed “All services operating normally,” leaving customers in the dark. (Ironically, AWS had previously pledged to improve its slow outage status updates.) By 1:26 AM PDT, they finally acknowledged “significant error rates” in DynamoDB on the status page, and by ~3:53 PM PDT posted the all-clear. The delayed communication frustrated many CTOs – a reminder that independent monitoring is essential (discussed later).

Outrageous Side Effects: From Smart Beds to Roasting Mattresses

The outage didn’t just hit websites – it wreaked havoc on IoT and smart devices in almost comical ways. For example, owners of the pricey Eight Sleep “Pod” smart mattresses found their beds had no offline mode and went haywire. The cloud-controlled beds overheated to 110°F and got stuck in bizarre inclined positions, literally “roasting” some sleepers and leaving others tilted upright. The CEO of Eight Sleep publicly apologized as angry customers reported waking up in pools of sweat at 3 A.M., unable to cool their $2,600 beds or even flatten them. Amazon’s own smart home gear was not spared either – Ring security cameras and Alexa voice assistants went dead during the outage, leaving some users unable to turn off lights except by hand.

It’s one thing for a doorbell camera to go offline for a while, but quite another for your bed to start cooking you in your sleep. This anecdote underscores a serious point: many modern products over-rely on cloud services with no local fallback. When AWS went down, it exposed this Achilles heel of IoT. (Eight Sleep has since rolled out an “Outage Access” patch so the app can directly control the bed locally during cloud failures.) For business leaders, the takeaway is that critical functions – whether in fintech software or in a smart appliance – shouldn’t have a single cloud dependency with no Plan B.

Getting a Refund: AWS’s Service Credit Policy Explained

If your organization was impacted by this outage, can you get a refund? The short answer: not a cash refund, but you can request service credits under AWS’s SLA (Service Level Agreement). AWS’s standard uptime guarantee for multi-AZ deployments is 99.99% monthly availability. When they fail to meet that, they offer service credits on a sliding scale:

Monthly Uptime ≥ 99.0% but < 99.99% – 10% service credit
Monthly Uptime ≥ 95.0% but < 99.0% – 30% service credit
Monthly Uptime < 95.0% – 100% credit (essentially a full month’s bill).

The October 20 outage lasted ~15 hours of serious disruption. In a 31-day month, that equates to around 98% uptime for the affected period, well below the 99.99% threshold. Thus, most AWS customers in US-EAST-1 qualify for a 30% credit for October (since uptime fell below 99.0% but stayed above 95%). In some cases, if portions of your infrastructure were completely unavailable for longer (e.g. certain EC2 instances down >5% of the month), you might claim instance-level SLA credits as well. But generally, 30% credit is the expected compensation for this event, rather than the maximum 100%. (To put it bluntly, AWS could have another similar outage and still not owe 100% credit unless total downtime exceeds ~36 hours in a month.)

Claim Process: Importantly, AWS does not automatically apply outage credits – you must submit a claim through AWS Support. As a CTO/CIO, ensure your team takes these steps:

Document the Impact: Identify which services and resources were down (e.g. EC2 instances, RDS databases in us-east-1) and the exact times. Gather logs or monitoring reports showing errors or downtime during the outage window. AWS may require proof of the disruption affecting your specific resources.
Open a Support Case: In the AWS Console, go to Support and open a case titled “AWS SLA Credit Request – [Region]” (e.g. Region-Level Claim for us-east-1). Specify the dates/times of the outage and the AWS services that were impacted. Include the resource IDs and your log evidence in the case notes.
Submit Before Deadline: SLA claims must be filed within two billing cycles of the incident. For an October outage, that means submit by end of December at the latest. So, don’t procrastinate – file the claim as soon as you have your information together.
Await Confirmation: AWS will review the claim, and if approved (it should be, given this widely publicized outage), they will issue the service credit by the next billing cycle. The credit typically appears on your AWS account as a discount on future bills (or in some cases, a refund to your credit card, at AWS’s discretion).

How Much Will You Get? Let’s walk through a realistic example. Suppose a small hedge fund runs its trading platform on AWS and spends about $50,000 per month on AWS resources in the us-east-1 region. After the October outage, their monthly uptime was only ~98%, breaching the 99.99% SLA. They file a claim, and AWS grants a 30% service credit for that month’s bill. That equates to a $15,000 credit applied to their account (30% of $50k).

While $15k is better than nothing, note that this is likely a drop in the bucket compared to the fund’s actual losses. If their algos missed market moves or they had to halt trading for hours, the business impact could far exceed the credit. This is a common theme: AWS’s service credits are “frequently insignificant” and don’t compensate for downstream losses in revenue or reputation. They’re essentially a small token. As tech lawyer Ryan Gracey put it, after such incidents “customers will be left with limited recourse” beyond these credits. In other words, you can get some cloud hosting fees knocked off, but don’t expect AWS to pay for your angry customers or missed transactions.

Nonetheless, you absolutely should claim your credit – it’s yours by right under the SLA. Just have tempered expectations. (And if you’re a huge AWS spender, consider negotiating custom SLAs in your contract; but even then, direct compensation beyond credits is rare.)

RocketEdge’s Recommendations: Building Resilience and Considering Azure

After guiding numerous clients through cloud outages, RocketEdge has a few pointed recommendations for CTOs and tech leaders to limit the damage from events like AWS’s outage:

1. Don’t Rely Solely on AWS – Multi-Cloud or At Least “Plan B” It

AWS’s October meltdown underscores a hard truth: despite its size, AWS is not infallible – and it’s arguably falling behind in the cloud innovation race. In 2025, AWS’s growth and cutting-edge offerings are lagging its rivals. For example, AWS’s revenue grew only 18% YoY in Q2 2025, significantly trailing Microsoft Azure’s 39% surge (and Google Cloud’s 32%), largely due to Azure’s aggressive investments in AI. Enterprise customers increasingly see AWS as the “safe but boring” choice, while turning to Azure for superior AI and productivity tools. We have observed that many AWS services (from AI/ML platforms to serverless dev tools) are now less feature-rich or user-friendly than competitor alternatives. Amazon’s cloud is struggling with an innovator’s dilemma – trying to retrofit its massive legacy infrastructure for the AI era – and it shows.

Our advice: consider diversifying to more modern platforms. Azure, in particular, has leapt ahead with offerings like Azure OpenAI Service (GPT integration), comprehensive ML Ops, and seamless Microsoft 365/Teams tie-ins that boost developer productivity. In practice, we find building on Azure can be far more productive for teams looking to leverage AI and advanced PaaS services, thanks to better integration and tooling. Meanwhile, AWS’s penchant for “do-it-yourself” modular services can slow teams down. Azure’s momentum (and massive 2025 growth) shows it’s currently the cloud to beat for cutting-edge workloads. Even Google Cloud, with Vertex AI and BigQuery, has made strides that outshine AWS in certain areas.

At minimum, adopt a multi-cloud strategy for critical systems. Many Wall Street firms, for instance, are now architecting key workloads to fail over to Azure or GCP when AWS has issues. Multi-cloud doesn’t have to mean double cost 24/7 – even having a cold standby environment or secondary provider for high-priority services can save your business when the primary cloud falters. If full multi-cloud is too complex, consider multi-region deployment (e.g. run active in both us-east-1 and another AWS region) to reduce reliance on one location – though remember in this outage, even multi-region AWS setups were affected due to control-plane coupling. Ultimately, cloud concentration risk is a board-level issue now; diversifying cloud providers (or using hybrid cloud with on-prem backups) is prudent for resiliency.

2. Implement Robust Monitoring and Incident Response

Don’t wait for AWS to tell you it’s down – by then you’re already in the thick of it. Set up independent monitoring that tracks your applications’ vital signs (latency, error rates, external service pings) from multiple vantage points. For example, use services like Pingdom, New Relic, Datadog, or custom scripts to continually test critical endpoints. If your system depends on AWS DynamoDB or S3, have synthetic transactions that alert you if those start failing abnormally. In the Oct 20 event, companies with good monitoring saw red flags within minutes of the DNS issues – while those relying on AWS’s status page were blind for an hour. Early detection allows you to launch your incident response playbook immediately: failover to backups, activate status communications to customers (“we’re experiencing cloud provider issues”), throttle non-critical processes, etc. Speed matters – the faster you realize “AWS is having an outage, not just us,” the more proactive (and calm) you can be.

Equally important is practicing your incident response. Conduct game days or disaster drills where you simulate a cloud provider outage. For instance, “What if AWS us-east-1 goes down – can we serve even a read-only version of our app from another region or our on-prem datacenter?” Identify the gaps before a real outage. Teams that had run such drills might have discovered, say, that a DNS failure could be bypassed by manually configuring host files or that they could quickly launch critical databases in another region. Have runbooks ready for various failure scenarios, including contact info for AWS support managers, instructions to deploy from infrastructure-as-code to a new region, etc. Your engineers should know the manual fallback procedures for cloud services (if any exist). And ensure your ops on call know how to reach real humans at AWS – during the outage, AWS support was swamped, but enterprise customers with TAM contacts had slightly better lines of info.

3. Architect for Resilience (Best Practices)

Finally, it’s essential to continuously fortify your cloud architecture following best practices. Cloud outages will happen; design such that you limit the blast radius and recover rapidly. Here are key principles:

Use Multi-AZ and Multi-Region Deployments: If you’re on AWS, distribute your workloads across multiple Availability Zones so a single data center failure won’t take you down. Even better, run active-active in two regions (e.g. us-east-1 and us-west-2 or AWS + Azure) so that if one region has issues, the other can handle traffic. In this outage, some global services failed anyway, but many customer architectures that were multi-region saw shorter disruptions or could reroute around the DNS failure after some tweaking.
Avoid Single Points of Dependency: Identify where you rely on a single cloud service with no alternative. For example, many apps relied solely on AWS’s Route 53 or internal DNS – a failure there was catastrophic. Consider backup DNS providers (some companies dynamically switched DNS to alternate providers during the outage). If you use a managed database (like DynamoDB) that could become unreachable, have at least read replicas or periodic backups in a different region or a plan to switch to a secondary data store in emergencies. This also applies to third-party APIs – if your product can’t function without some external API, cache responses or have degraded-mode logic.
Keep Software Up-to-Date: Often outages reveal bugs or limitations that are addressed in newer versions of software. Ensure your systems (OS, libraries, cloud SDKs) are regularly updated to benefit from resiliency improvements. For instance, after previous outages, AWS enhanced some SDKs to handle DNS timeouts better – but you’d only benefit if you upgraded your SDK version. Likewise, apply AWS’s new features – e.g. if AWS offers an option to use Multi-Region Access Points for S3 or Global Databases for RDS to improve resilience, evaluate adopting them. In the Eight Sleep case, they’re pushing a firmware update to enable local control – a lesson that updates can literally be lifesavers (or at least sleep-savers). Don’t postpone deploying upgrades that add offline functionality or better failover.
Follow Well-Architected Best Practices: AWS’s Well-Architected Framework, Azure’s Architecture Center, and Google’s Cloud Architecture guidelines all emphasize reliability design. Use those checklists. Implement circuit breakers in your application – if a dependency is failing (e.g. calls to DynamoDB hang), your app should quickly timeout and perhaps serve a cached response instead of hanging entirely. Implement exponential backoff in retries to avoid overload during partial outages. And log abundantly – in an outage, logs and metrics are your lifeline to pinpoint what’s broken.
Invest in Backup and DR (Disaster Recovery): Ensure you have recent backups of critical data in a separate location. Outage or not, data integrity is paramount. Some AWS outages in the past have caused data loss (fortunately not this DNS issue), so a robust backup practice is non-negotiable. Simulate a scenario where you restore your whole environment from scratch – how long would it take? The quicker you can rebuild or fail over, the less an outage can hurt you.

In summary, architect with the mindset that any single component (or cloud!) will fail at some point. AWS’s October event exposed how fragile a supposedly robust system can be when a hidden corner (DNS) breaks. By preparing for failure, you turn a cloud outage from a business-crippling disaster into a manageable inconvenience.

4. Reevaluate Your Cloud Strategy – Is AWS Still the Right Partner?

This is a tough question, but leaders must ask it. AWS has long been the market leader, yet recent trends show it losing its edge. It’s not just the outages (though a few major ones have occurred in the past five years); it’s also about innovation and support. Microsoft and Google have been outpacing AWS in AI-driven cloud services and perhaps in customer support responsiveness. If your business is heavily invested in AI, analytics, or high-productivity developer environments, Azure (or even a hybrid multi-cloud approach) may yield better results. Azure’s integration with enterprise tools and its focus on AI make it a formidable choice. We’ve found that clients leveraging Azure’s ecosystem (e.g. using Azure OpenAI for advanced analytics, Power Platform for low-code needs, etc.) can iterate faster than on AWS’s more siloed services.

At RocketEdge, we remain cloud-agnostic but candid: in 2025, AWS is no longer the no-brainer pick for every use case. Evaluate your priorities – if uptime and support are paramount, diversify providers. If AI capabilities are key, look at Azure or Google. AWS still has strengths (breadth of services, global footprint), but if they’re “the safe but boring choice” and can’t keep up with your innovation needs, don’t be afraid to migrate critical workloads. Cloud migrations are non-trivial, but the cost of not moving (in terms of potential downtime or slower progress) might be higher.

Conclusion

The AWS outage of Oct 20, 2025 was a wake-up call. It reminded us that even the mightiest cloud can stumble due to a mundane error, and the fallout can be extraordinary – from billions in business disrupted to smart beds on the fritz. The silver lining is that AWS’s SLA does entitle customers to some credits; for a typical small hedge fund or startup, that might mean thousands of dollars off your bill (make sure to claim it!). But far more valuable is the lesson in resilience. Use this event to justify improvements: build that multi-cloud failover, implement better monitoring, update that contingency playbook. And critically assess whether your cloud provider is meeting your needs or if it’s time to explore alternatives that offer a stronger future roadmap.

Ultimately, cloud outages are inevitable – but how prepared your organization is, and which partners you trust, will determine whether an outage is a mere inconvenience or a catastrophic hit. By taking the steps above, you can greatly limit the blast radius of the next cloud outage and keep your business running come hell or high water (or DNS failures). After all, in the 24/7 digital economy, sleeping soundly at night might just require both a backup cloud… and maybe an old-fashioned mattress that doesn’t connect to the internet.

Author

Jiri Pik

View all posts

AWS, Outage