I have a sticker on my laptop reminding me that “The cloud is just other people’s computers.” (from StickerMule) There is no cloud magic. If you extend your global Financial Services operations into the cloud, it needs to be clearly and verifiably aligned with your risk management practices, your compliance obligations, your contracts, and the assumptions of your various constituencies. That is a tall order. Scan the rest of this short outline and then remember to critically evaluate the claims of the hypesters & hucksters who sell “cloud” as the solution to virtually any of your challenges.
Amazon reminded all of us of that fact this week when maintenance on some of their cloud servers cascaded into a much larger 2 hour service outage.
No data breach. No hack. Nothing that suggests hostile intent. Just a reminder that the cloud is a huge, distributed pile of “other people’s computers.” They have all the hardware and software engineering, operations, and life-cycle management challenges that your staff find in their own data centers. A key difference, though, is that they are also of fantastic scale, massively shared, and their architecture & operations may not align with global Financial Services norms and obligations.
Amazon reported that the following services were unavailable for up to two and half hours Tuesday Morning (28 Feb, 2017):
- S3 storage
- The S3 console
- Amazon Elastic Compute Cloud (EC2) new instance launches
- Amazon Elastic Block Store (EBS) volumes
- AWS Lambda
This resulted in major customer outages.
Here is how Amazon described the outage:
- “…on the morning of February 28th. The Amazon Simple Storage Service (S3) team was debugging (a billing system) issue…”
- “At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process.”
- “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
- “The servers that were inadvertently removed supported two other S3 subsystems.”
- “One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests.”
- “The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects.”
- “Removing a significant portion of the capacity caused each of these systems to require a full restart.”
- “While these subsystems were being restarted, S3 was unable to service requests.”
- “Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.”
There is no magic in the cloud. It is engineered and operated by people. Alignment between your corporate culture, your corporate compliance obligations, your contractual obligations, and those of your cloud providers is critical to your success in global Financial Services. If those cloud computers and the activities by armies of humans who manage them are not well aligned with your needs and obligations, then you are simply depending on “hope” — one of the most feeble risk management practices. You are warned — again.
What do you think?
“The embarrassing reason behind Amazon’s huge cloud computing outage this week.”
By Brian Fung, March 2
“Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region.”