Router Update Cripples Microsoft’s Online Services

Actually only an IP address should be changed. A command used for this purpose paralyzes parts of Microsoft’s Wide Area Network via a router. Systems for automatic error correction are also affected.

Microsoft has commented on the reasons for the outages of Teams, Exchange Online, Outlook and other services that lasted several hours last week. According to the statement, a router update caused Azure, Microsoft 365 and the Power platform to be temporarily inaccessible for many customers.

Before the disruptions began, Microsoft had advised customers that a planned update could lead to higher latencies for accesses to Azure, Microsoft 365 and Power BI. However, as the business day began in Europe, it then became apparent that the update was disrupting the Microsoft Wide Area Network (WAN), causing disconnections between individual services in Microsoft’s data centers.

It all starts with a new IP address

“We found that a change to the Microsoft Wide Area Network (WAN) affected connectivity between clients on the Internet and Azure, connectivity between regions, and cross-site connectivity via ExpressRoute,” Microsoft has now announced. “As part of a planned change to update the IP address on a WAN router, a command given to the router caused it to send messages to all other routers on the WAN, which caused all routers to recalculate their forwarding tables. During this recalculation, the routers were unable to correctly forward packets traversing them. The command that caused the problem behaves differently on different network devices, and the command had not been validated with our full qualification process on the router on which it was executed.”

Microsoft said the change also affected the systems that automatically control the operation of the WAN and identify and remove devices on the WAN that are not operating regularly. The system that optimizes data flows on the network was also affected, he said. This had led to further packet loss and necessitated a manual restart of the systems.

To prevent similar incidents, Microsoft is now “blocking the execution of commands that have a large impact on devices.” In addition, secure change policies must now be followed when executing commands on network devices.