The sales job included a theme that the Internet is so decentralized that anything named cloud simply inherited all the good ‘ilities’ that massive decentralization might deliver. One of the core tenets of that Internet story-telling is about the resiliency that decentralization guarantees — as in: no single outage can cause problems for your cloud-delivered services.
The reality is that story is too often just nonsense. Google, Amazon, and Microsoft control enormous amounts of the infrastructure and operations upon which “cloud” services depend. When any of these three companies has an outage, broad swaths of “cloud-delivered” business services experience outages as well. If you are going to depend on cloud-enabled operations for your success, use care in how you define that success, and how your communicate about service levels with your customers. This is especially problematic for industries where customers have been trained to expect extremely high service levels — global financial services enterprises fall into this category. Factor this risk into your business & technical plans.
Yesterday, Tuesday, July 17th, Google’s services that provide computing storage and data management tools for companies failed. The customers who depended upon those foundational services also failed. Snapchat and Spotify were two high profile examples, but the outages were far more widespread. Google services that also depend upon the storage and data management tools also failed. It appears that Google Cloud Networking was knocked down as the company reported “We are investigating a problem with Google Cloud Global Load balancers returning 502s for many services including AppEngine, Stackdriver, Dialogflow, as well as customer Global Load Balancers.” This has broad impact because all customers attempting to deliver high quality services depend on load balancers. It appears that this networking issue caused downstream outages like those experienced by Breitbart and Drudge Report.
This was the 5th non-trivial outage this year for Google’s AppEngine. Google’s ComputeEngine has had 7 outages this year, 12 for their Cloud Networking, 8 for their Stackdriver, 3 for Google Cloud Console, 3 for Cloud Pub/Sub, 4 for Google Kubernetes Engine, 2 for Cloud Storage, 4 for BigQuery, 2 for their DataStore, 2 for their Cloud Developer Tools, and 1 reported for their Identity & Security services.
To add insult, it appears that the Google Enterprise Support page may have also not working during the outage on the 17th…
Amazon also experienced service failures that were widely reported the day before (although they are not documented on the company’s service health dashboard).
Microsoft had “internal routing” failures that resulted in widespread service outages for over an hour on the 16th as well.
Plan this reality into your cloud-enabled strategies, architectures, designs, implementations, testing, monitoring, reporting, and contracts.
July 31, 2018 UPDATE:
One of my peers asked about the ability of legacy financial services IT shops to deliver ‘cloud-like’ service levels, arguing that all of our global financial services enterprises have material entrenched IT infrastructure that has service level issues as well. It seems like a fair reaction, so here is my response:
My motivation for the essay above was driven by several observations:
(1) There are still individuals in our industry who seem to believe there is some magical power available to ‘cloud’ things that will make existing problems and constraints disappear. These individuals make me tired. I tried to highlight a little of cloud vendor’s muggle nature.
(2) The three main cloud vendors I mention — Amazon, Google and Microsoft — provide what I perceive as selective and self-serving views into their service outages. They have outages, short and not-so-short, that are unreported on all of their outage-reporting interfaces available to me. That behavior seems to be designed into their development and infrastructure management practices as well as their sales/marketing practices. When they fail fast, it is on customer’s time and investments (ours), not only on their own (and, I believe, rapid recovery skills do not justify or compensate for any given outage…). I mentioned Amazon on this score, but I my reading helps me believe that all three exhibit this behavior.
(3) We all have ‘knobs’ or ‘levers’ available to us at our corporations to influence our service levels in ways that might help us distinguish ourselves from our competition. My working understanding is that service levels are just an expression of management will. I get it that those responsible for serious profit/loss decision-making have a fiendishly difficult role. That said, if it were important-enough, our leaders would adjust the available management ‘knobs’ in ways that do/can/would deliver whatever was needed — on our systems or those owned & managed by others. I understand that there are a hornet’s nest of competing priorities and trade-offs that make those decisions tricky. I also know that many of those ‘knobs’ are not available to our management teams in analogous ways for most of our ‘cloud’ vendor candidates.
Some argue that there are the right drivers to replace our systems with ‘cloud-enabled’ vendor services in combination with some amount of our code and configuration. To live out that desire, it seems like most of us would need to re-architect much of our business practices and in parallel the stacks of systems that enable them if we were expecting to deliver competitive sets of features and accompanying service levels in the global financial services enterprise arena. A material chunk of that re-architecting would need to compensate for the outage patterns exhibited by the ecosystem of major and minor players involved — and because this is my blog, to compensate for the new information security risk management challenges that path presents as well.
Google Cloud Status Dashboard
Amazon Service Health Dashboard
Azure Status History
Google Cloud Has Disruption, Bringing Snapchat, Spotify Down
By Mark Bergen, July 17, 2018
Spotify, Snapchat, and more are down following Google Cloud incident (update: fixed)
Jeff Grubb, JULY 17, 2018
Google Cloud Platform fixes issues that took down Spotify, Snapchat and other popular sites
Chloe Aiello, 07-17-2018
[Update: Resolved] Google Cloud has been experiencing an outage, resulting in widespread problems with several services
By Ryne Hager, 07-17-2018
Google Enterprise Support page Outage Reference