Azure Outage Disrupts VMs and Managed Identities for Over 10 Hours

Azure outage timeline and scope (VMs, identity, developer workflows)

Microsoft’s Azure cloud platform experienced a multi-hour disruption that affected core enterprise operations for over 10 hours, starting at 19:46 UTC and resolving at 06:05 UTC.

The incident first hit a painful, very “control-plane” kind of failure: customers couldn’t deploy or scale virtual machines (VMs) across multiple regions. And then it spread into identity-related operations, with a related platform issue impacting Managed Identities for Azure Resources in East US and West US between 00:10 UTC and 06:05 UTC. The ripple effects also briefly affected GitHub Actions, which matters because it connects outages directly to build, release, and operational tempo—not just runtime availability.

Root cause: a storage account policy change that blocked public read access

At the center of the disruption was a policy change that was unintentionally applied to a subset of Microsoft-managed storage accounts. Those storage accounts included ones used to host virtual machine extension packages.

That detail is key: VM extensions are a common dependency in provisioning, configuration, and lifecycle operations. When the policy change blocked public read access, Azure scenarios that rely on downloading VM extension packages broke. Microsoft described the issue in its status history as a disruption to extension package downloads from Microsoft-managed storage accounts—turning what looks like “just storage policy” into widespread compute and automation failures.

What customers saw: VM provisioning errors and lifecycle operation failures

The outage was logged under tracking ID FNJ8-VQZ. Impact wasn’t limited to “new VMs can’t be created”—it also showed up as failures during VM provisioning and lifecycle operations. When the platform can’t fetch extension packages, a lot of automation that’s normally invisible suddenly becomes the bottleneck.

This is why these incidents feel so chaotic: the underlying workload might be fine, but the systems that deploy, scale, and update workloads start failing in ways that look inconsistent across teams and regions.

Downstream service impact: AKS, Azure DevOps, and GitHub Actions pipeline failures

Once extension downloads degraded, the impact spilled into services that depend on those packages:

Azure Kubernetes Service (AKS) provisioning and extensions

Azure Kubernetes Service users experienced failures in:

Node provisioning
Extension installation

In practical terms, this blocks scaling events and cluster operations that organizations often rely on during traffic shifts or incident response. If you can’t add nodes or install required extensions, you’re stuck in place.

CI/CD disruptions: Azure DevOps and GitHub Actions

The outage also hit developer productivity and release pipelines:

Azure DevOps pipelines failed when tasks required VM extensions or related packages
GitHub Actions users saw pipeline failures in similar extension/package-dependent steps

So this wasn’t just a “production runtime” story. It also disrupted the machinery teams use to fix production, ship mitigations, and move code safely.

Second failure: mitigation triggered Managed Identities authentication failures

Microsoft deployed an initial mitigation within about two hours, but that action led to a second platform issue involving Managed Identities for Azure Resources.

Customers attempting to:

Create, update, or delete Azure resources
Acquire Managed Identity tokens

…began experiencing authentication failures.

Microsoft’s status history (tracking ID M5B-9RZ) noted that after the earlier mitigation, a large spike in traffic overwhelmed the managed identities platform service in East US and West US. This is the classic “fix causes surge, surge breaks adjacent control plane” pattern—especially when retry storms and backlog replays stack up.

Azure services impacted by Managed Identities disruption

The Managed Identities platform issue impacted creation and use of Azure resources with assigned managed identities, including:

Azure Synapse Analytics
Azure Databricks
Azure Stream Analytics
Azure Kubernetes Service
Microsoft Copilot Studio
Azure Chaos Studio
Azure Database for PostgreSQL Flexible Servers
Azure Container Apps
Azure Firewall
Azure AI Video Indexer

That list is a reminder that identity isn’t a side feature—it’s a backbone dependency. If managed identities wobble, everything from analytics to containers to security controls can degrade in ways that look unrelated on the surface.

How Azure recovered: traffic removal to repair infrastructure without load

Microsoft attempted multiple infrastructure scale-up efforts, but those attempts couldn’t handle the backlog and retry volumes. Ultimately, Microsoft removed traffic from the affected service so it could repair the underlying infrastructure without load.

This is a hard-but-real operational move: when retries pile up, “adding capacity” sometimes just feeds the fire. Cutting traffic can be the only way to restore a stable base and then reintroduce load gradually.

Why this kind of outage hits harder than “a website went down”

Pareekh Jain, CEO at EIIRTrend & Pareekh Consulting, summarized the practical damage: the outage didn’t just knock sites offline—it halted development workflows and disrupted real-world operations.

That’s the emotional core for a lot of teams. You can tolerate some customer-facing turbulence if you can still deploy, scale, authenticate, and run your incident playbooks. But when the control plane and identity layers are impaired, you’re suddenly fighting with one hand tied behind your back.

Cloud outages on the rise: what this incident signals about modern dependencies

The context here is broader: cloud outages have become more frequent, with major providers (including AWS, Google Cloud, and IBM) experiencing high-profile disruptions. The article points to examples like:

AWS services impacted for more than 15 hours due to a DNS problem affecting the DynamoDB API
Cloudflare disruptions tied to a bad configuration file in Bot Management
Google identity and access management disruption from an invalid automated update, affecting authentication on third-party apps

Neil Shah (Counterpoint Research) connects a bigger trend: data center architecture is evolving under AI-driven workload demands—more velocity, more variability, more complexity. That complexity increases dependency chains, so a control-layer misconfiguration can cascade quickly.

What CIOs and IT leaders can do next: resilience actions during and after hyperscale incidents

This event reinforces that “wait it out” isn’t much of a strategy when hyperscale dependencies fail. The article outlines practical resilience guidance, especially for CIOs, framed around stabilize, prioritize, and communicate.

Stabilize: treat it like a formal cloud incident

Jain recommends:

Declare a formal cloud incident with a single incident commander
Determine whether the issue affects control-plane operations or running workloads
Freeze non-essential changes (deployments, infrastructure updates)

This matters because control-plane incidents behave differently than workload incidents. Freezing changes reduces self-inflicted damage when the platform is already unstable.

Prioritize restoration: protect customer run paths and keep delivery moving

The next move is to prioritize restoration by protecting customer-facing “run paths,” including:

Traffic serving
Payments
Authentication
Support

And if CI/CD is impacted:

Shift critical pipelines to self-hosted or alternate runners
Queue releases behind a business-approved gate

The point isn’t to keep shipping at all costs. It’s to keep the right fixes and operational changes moving while the platform is degraded.

Communicate and contain: predictable updates and pre-approved templates

Jain also recommends:

Regular internal updates stating impacted services, workarounds, and next update time
Activating pre-approved customer communication templates if external impact is likely

When control-plane and identity services are unstable, confusion becomes its own outage. Tight communication reduces time wasted on guesswork.

Longer-term architecture guidance: hybrid, multi-cloud, redundancy, and lean pipelines

Shah’s recommendations focus on reducing blast radius:

Diversify workloads across cloud service providers or go hybrid
Add necessary redundancies
Keep CI/CD pipelines lean and modular
Think carefully about real-time vs non-real-time scaling strategies for crucial services
Maintain operational visibility of hidden dependencies and plan mitigations for what can be impacted

The throughline is dependency management. Outages like this expose the places where teams assumed “that service is always there.”

Q&A: Azure outage impact, managed identities, and resilience planning

Q1) What triggered the Azure VM deployment and scaling failures?

A policy change unintentionally applied to Microsoft-managed storage accounts blocked public read access, disrupting downloads of VM extension packages and causing VM provisioning and lifecycle failures.

Q2) Why did Managed Identities for Azure Resources fail after the initial mitigation?

Following the earlier mitigation, a large spike in traffic overwhelmed the managed identities platform service in East US and West US, leading to authentication failures for managed identity token acquisition and resource operations.

Q3) Which enterprise services were impacted by the Managed Identities disruption?

Impacted services included Azure Synapse Analytics, Azure Databricks, Azure Stream Analytics, AKS, Microsoft Copilot Studio, Azure Chaos Studio, Azure Database for PostgreSQL Flexible Servers, Azure Container Apps, Azure Firewall, and Azure AI Video Indexer.