The sales job included a theme that the Internet is so decentralized that anything named cloud simply inherited all the good ‘ilities’ that massive decentralization might deliver. One of the core tenets of that Internet story-telling is about the resiliency that decentralization guarantees — as in: no single outage can cause problems for your cloud-delivered services.
The reality is that story is too often just nonsense. Google, Amazon, and Microsoft control enormous amounts of the infrastructure and operations upon which “cloud” services depend. When any of these three companies has an outage, broad swaths of “cloud-delivered” business services experience outages as well. If you are going to depend on cloud-enabled operations for your success, use care in how you define that success, and how your communicate about service levels with your customers. This is especially problematic for industries where customers have been trained to expect extremely high service levels — global financial services enterprises fall into this category. Factor this risk into your business & technical plans.
Yesterday, Tuesday, July 17th, Google’s services that provide computing storage and data management tools for companies failed. The customers who depended upon those foundational services also failed. Snapchat and Spotify were two high profile examples, but the outages were far more widespread. Google services that also depend upon the storage and data management tools also failed. It appears that Google Cloud Networking was knocked down as the company reported “We are investigating a problem with Google Cloud Global Load balancers returning 502s for many services including AppEngine, Stackdriver, Dialogflow, as well as customer Global Load Balancers.” This has broad impact because all customers attempting to deliver high quality services depend on load balancers. It appears that this networking issue caused downstream outages like those experienced by Breitbart and Drudge Report.
This was the 5th non-trivial outage this year for Google’s AppEngine. Google’s ComputeEngine has had 7 outages this year, 12 for their Cloud Networking, 8 for their Stackdriver, 3 for Google Cloud Console, 3 for Cloud Pub/Sub, 4 for Google Kubernetes Engine, 2 for Cloud Storage, 4 for BigQuery, 2 for their DataStore, 2 for their Cloud Developer Tools, and 1 reported for their Identity & Security services.
To add insult, it appears that the Google Enterprise Support page may have also not working during the outage on the 17th…
Amazon also experienced service failures that were widely reported the day before (although they are not documented on the company’s service health dashboard).
Microsoft had “internal routing” failures that resulted in widespread service outages for over an hour on the 16th as well.
Plan this reality into your cloud-enabled strategies, architectures, designs, implementations, testing, monitoring, reporting, and contracts.
Google Cloud Status Dashboard
Amazon Service Health Dashboard
Google Cloud Has Disruption, Bringing Snapchat, Spotify Down
By Mark Bergen, July 17, 2018
Spotify, Snapchat, and more are down following Google Cloud incident (update: fixed)
Jeff Grubb, JULY 17, 2018
Google Cloud Platform fixes issues that took down Spotify, Snapchat and other popular sites
Chloe Aiello, 07-17-2018
[Update: Resolved] Google Cloud has been experiencing an outage, resulting in widespread problems with several services
By Ryne Hager, 07-17-2018
Google Enterprise Support page Outage Reference