Apple iOS Vuln via Mail

April 26, 2020

ZecOps announced a collection of iOS vulnerabilities associated with the iOS Mail app that enables hostile agents to run arbitrary code and to delete messages since at least since iOS 6…
So far, this has been described as a set of Out-Of-Bounds write and Heap-Overflow vulnerabilities that are being used against targeted high value endpoints. My interpretation of their detailed write-up is that this qualifies as a remote, anonymous, arbitrary code execution vulnerability. As such, even if it must be targeted and even if it may not be an ‘easy‘ attack, because global financial services organizations are targeted by some amount of determined adversaries, we need to take it seriously.

Apple responded by rejecting the idea that this represented an elevated risk for consumers because their security architecture was not violated and they found no evidence of impact to their customers — while engineering a fix that will be rolled out soon. Is it time to factor this elevated risk behavior (reject-but-fix) into our threat models?

The ZecOps headline was:

“The attack’s scope consists of sending a specially crafted email to a victim’s mailbox enabling it to trigger the vulnerability in the context of iOS MobileMail application on iOS 12 or maild on iOS 13. Based on ZecOps Research and Threat Intelligence, we surmise with high confidence that these vulnerabilities – in particular, the remote heap overflow – are widely exploited in the wild in targeted attacks by an advanced threat operator(s).”

For global financial services enterprises, the presence of hundreds of billions, even trillions of dollars in one or another digital form seems to make this risk rise to the level of relevance. This is especially true because of the effectiveness of Apple’s marketing techniques across broad categories of roles expected to populate our organizations — i.e., our staff and leaders often use Apple devices.

On one front “Apple’s product security and the engineering team delivered a beta patch (to ZecOps) to block these vulnerabilities from further abuse once deployed to GA.”

On another front Apple also publicly rejected the ZecOps claims about finding evidence of the exploit being used saying “are insufficient to bypass iPhone and iPad security protections, and we have found no evidence they were used against customers.” If I read this assertion carefully and in the context of potential future legal action or sales headwinds, it does not inspire confidence that the vulnerabilities were not real-and-exploitable-as-described — only that Apple rejects some narrowly-crafted subset of ZecOps’ announcement/analysis and that they still stand behind the effectiveness of some subset of the iOS architecture.

Apple’s full statement:

“Apple takes all reports of security threats seriously. We have thoroughly investigated the researcher’s report and, based on the information provided, have concluded these issues do not pose an immediate risk to our users. The researcher identified three issues in Mail, but alone they are insufficient to bypass iPhone and iPad security protections, and we have found no evidence they were used against customers. These potential issues will be addressed in a software update soon. We value our collaboration with security researchers to help keep our users safe and will be crediting the researcher for their assistance.”

The Apple echo-chamber kicked in to support the rejection in its most comprehensive and positive interpretation…

ZecOps’ summary of their findings includes (quoted):

  • The vulnerability allows remote code execution capabilities and enables an attacker to remotely infect a device by sending emails that consume significant amount of memory
  • The vulnerability does not necessarily require a large email – a regular email which is able to consume enough RAM would be sufficient. There are many ways to achieve such resource exhaustion including RTF, multi-part, and other methods
  • Both vulnerabilities were triggered in-the-wild
  • The vulnerability can be triggered before the entire email is downloaded, hence the email content won’t necessarily remain on the device
  • We are not dismissing the possibility that attackers may have deleted remaining emails following a successful attack
  • Vulnerability trigger on iOS 13: Unassisted (/zero-click) attacks on iOS 13 when Mail application is opened in the background
  • Vulnerability trigger on iOS 12: The attack requires a click on the email. The attack will be triggered before rendering the content. The user won’t notice anything anomalous in the email itself
  • Unassisted attacks on iOS 12 can be triggered (aka zero click) if the attacker controls the mail server
  • The vulnerabilities exist at least since iOS 6 – (issue date: September 2012) – when iPhone 5 was released
  • The earliest triggers we have observed in the wild were on iOS 11.2.2 in January 2018

Like any large-scale software vendor, Apple fixes a lot of bugs and flaws. I am not highlighting that as an issue.  A certain amount of bugs & flaws are expected in large scale development efforts.  I think that it is important to keep in mind that iOS devices are regularly found in use in safety and critical infrastructure operations, increasing the importance of managing the software lifecycle in ways that minimize the number, scope and nature of bugs & flaws that make it into production.

Apple has a history of enthusiastically rejecting the announcement of some interesting and elevated risk vulnerabilities using narrowly crafted language that would be likely to stand up to legal challenge while concurrently rolling out fixes — which often seems like a pretty overt admission of a given vulnerability.
This behavior leaves me thinking that Apple has created a corporate culture that is impacting their ability to do effective threat modeling.  From the outside, it increasingly appears that Apple’s iOS trust boundaries are expected to match the corporation’s marketing expressions of their control architecture — ‘the happy path‘ where formal iOS isolation boundaries matter only in ways that are defined in public to consumers and that other I/O channels are defined-out of what matters… If I am even a little correct, that cultural characteristic needs to be recognized and incorporated into our risk management practices.

Given the scale of their profits, Apple has tremendous resources that could be devoted to attack surface definition, threat modeling, and operational verification of their assumptions about the same. Many types of OOB Write and Heap-Overflow bugs are good targets for discovery by fuzz testing as well. Until recently I would have assumed that by this point in iOS & iPhone/iPad maturation, Apple had automation in place to routinely, regularly & thoroughly fuzz obvious attack vectors like inbound email message flow in a variety of different ways and at great depth.

This pattern of behavior has been exhibited long enough and consistently enough that it seems material for global financial services enterprises. So many of our corporations support doing large amounts of our business operations on iDevices. I think that we need to begin to factor this elevated risk behavior into our threat models. What do you think?

REFERENCES:
“You’ve Got (0-click) Mail!” By SecOps Research Team, 04-20-2020
https://blog.zecops.com/vulnerabilities/youve-got-0-click-mail/

“Apple’s built-in iPhone mail app is vulnerable to hackers, research says.” By Reed Albergotti, 2020-04-23
https://www.washingtonpost.com/technology/2020/04/23/apple-hack-mail-iphone/

“Apple downplays iOS Mail app security flaw, says ‘no evidence’ of exploits — ‘These potential issues will be addressed in a software update soon’” By Jon Porter, 2020-04-24 https://www.theverge.com/2020/4/24/21234163/apple-ios-ipados-mail-app-security-flaw-statement-no-evidence-exploit


Breach May Indicate Quality Management Weaknesses

February 26, 2020

There is a new reason for concern about facial recognition technology, surveillance, and the error & bias inherent in their use.  The quality of the applications that make up these systems may be less well managed than one might assume or hope.

Clearview AI is a startup that scrapes social media platforms has compiled billions of photos for facial recognition technology reported that:

…an intruder “gained unauthorized access” to its list of customers, to the number of user accounts those customers had set up, and to the number of searches its customers have conducted.
…that there was “no compromise of Clearview’s systems or network.”

Tor Ekeland, an attorney for the company said what I read as the equivalent of ‘trust us & don’t worry about it’:

“Security is Clearview’s top priority, unfortunately, data breaches are part of life in the 21st century. Our servers were never accessed. We patched the flaw, and continue to work to strengthen our security.”

The company sells its services to hundreds of law-enforcement agencies & others.  The New York Times reported that Clearview’s app is being used by police to identify victims of child sexual abuse.

In one of their services, a user uploads a photo of a person, and the application replies with links to Internet-accessible photos on platforms where Clearview scraped them.  In another (not yet a public product), it appears that there are interfaces to augmented reality devices so the user might be able to identify every person they saw.

So, what could go wrong?

Based on the available reporting and their lawyer’s statements, my assumptions include;

  • The company amasses billions of images of human faces along with metadata about each — which include, but are not limited to links to the original hosting location on-line.
  • The company sells their services to policing and security-related organizations world-wide.
  • Something went seriously wrong with the way that their application (and/or infrastructure) enforced access control — leading me to believe that the company has ineffective secure coding and/or secure code analysis practices.
  • The company states that we should accept their assertion that breaches of Clearview’s applications are just a part of doing business.

Application quality and management attitude/values matter.

Because Clearview’s decisions about which photos of given human faces are associated with other photos representing the same individual can be used for identifying criminal suspects, they have more or less weight in criminal investigations and the subsequent litigation & imprisonment…  If Clearview AI points an investigator to the wrong individual, the consequences can be extreme.  In that context — because we should not expect or tolerate unfounded or mistaken arrest or imprisonment — weak or otherwise ineffective application architecture, design, or implementation should be strongly resisted.  To me, nothing in Clearview’s public statements about the breach inspire confidence that they have that mandate-for-quality in their company’s DNA (you may read their statements differently).

Ineffective application development (security issues are one facet) can result in almost any kind of flaw — some of which could result in incidental or systemic errors matching photos.  This has happened before — as there have been examples of widely-used face-matching AI implementations being materially less accurate on images associated with a given race or gender.

There are other risks.  When used by some individuals (authorized or not), it seems reasonable to assume that the Clearview’s system(s) will be used in ways that result in blackmail, coercion, or other types of attacks/threats.  This is not to imply that the company designed it for those purposes, just that it just seems like a good fit.  (We tolerate the sale of hand guns, axes and steak knives even though they can also play a key role in blackmail, coercion, or other types of attacks/threats as well.)  In part because of its global access and the ability of a hostile party to remain largely ‘unseen’ attacks that use Clearview’s applications are materially different from those other weapons.

In global financial services enterprises we deal with constant oversight of our risk management practices.  The best teams seem to be organized in ways that enhance the probability of strong and effective attack resistance over time — tolerating the challenges of evolving features, technology, operations, and attacks.  In my experience, it is often relatively easy to identify this type of team…

That is one end of a broad continuum of quality management applicable to any industry.  Some teams exist elsewhere on that continuum, and it is not always easy to peg where that might be for given organizations.  In the public facts and company statements associated with the recent Clearview breach, it does not look like they occupy the location on that continuum that we would hope.

REFERENCES:

“Facial-Recognition Company That Works With Law Enforcement Says Entire Client List Was Stolen.” By Betsy Swan, Feb. 26, 2020
https://www.thedailybeast.com/clearview-ai-facial-recognition-company-that-works-with-law-enforcement-says-entire-client-list-was-stolen

“Clearview AI has billions of our photos. Its entire client list was just stolen.” By Jordan Valinsky, February 26
https://www.cnn.com/2020/02/26/tech/clearview-ai-hack/index.html

And for some broader background:

“The Secretive Company That Might End Privacy as We Know It.” By Kashmir Hill, Published Jan. 18, 2020 and Updated Feb. 10, 2020
https://www.nytimes.com/2020/01/18/technology/clearview-privacy-facial-recognition.html
and https://www.nytimes.com/2020/02/10/podcasts/the-daily/facial-recognition-surveillance.html

“This man says he’s stockpiling billions of our photos.” By Donie O’Sullivan, Mon February 10, 2020
https://www.cnn.com/2020/02/10/tech/clearview-ai-ceo-hoan-ton-that/index.html


Facts that support a certain level of concern about security and communications products.

February 12, 2020

Facts that support a certain level of concern about security and communications products.

Read Greg Miller’s excellent history of Crypto AG:

“The intelligence coup of the century — For decades, the CIA read the encrypted communications of allies and adversaries.”
By Greg Miller, Feb. 11, 2020
https://www.washingtonpost.com/graphics/2020/world/national-security/cia-crypto-encryption-machines-espionage/

Some Financial Services organizations used encryption technology from Crypto AG as well.

(Miller’s work is supported by: Documents and research by the George Washington University National Security Archive https://nsarchive.gwu.edu/briefing-book/chile-cyber-vault-intelligence-southern-cone/2020-02-11/cias-minerva-secret
a 1997 outline of this history at https://cryptome.org/jya/nsa-sun.htm
as well as https://wikispooks.com/wiki/Crypto_AG and more)
And think about it in the context of Edward Snowden’s releases:
https://en.wikipedia.org/wiki/Edward_Snowden

And think about it in the context of Edward Snowden’s releases:
https://en.wikipedia.org/wiki/Edward_Snowden

What a complicated world.


Open Office Fail

January 11, 2020

Risk management in global financial services enterprises is necessarily a highly collaborative exercise.  Effective managers understand and nurture this foundational characteristic of our mission.

…And then there are the Open Office pushers.  My experience with mainstream open office workplace religion over the last ten years or so has been, well, not something that I would like to go through again.  It seems like managers caught up in Open Office echo chamber must be driven by a range of motivations that express themselves in an odd mix of xkcd irony and dilbert-speak.

I just read an essay on the effectiveness of Open Office workplace environments.  The authors appear to do a reasonable job collecting and analyzing real & relevant data to support their observation that:

When the firms switched to open offices, face-to-face interactions fell by 70%.
Ethan Bernstein & Ben Waber in The Truth About Open Offices.

Regardless of your opinions on this topic, if you are in the risk management business, you have to understand the value of effective collaboration and the need to protect it from forces that undermine it.  Bernstein & Waber’s work on this topic is well worth your attention.  Read it now: https://hbr.org/2019/11/the-truth-about-open-offices

REFERENCES:

The Truth About Open Offices.
By Ethan Bernstein & Ben Waber
From the November–December 2019 Issue of Harvard Business Review
https://hbr.org/2019/11/the-truth-about-open-offices

 


DeepFakes Also a Threat to Corporate Brands

December 16, 2019

The corporate risk management business just keeps expanding….

Attention to ‘deepfakes’ is often mapped to elections or celebrities.

Misleading videos or audios altered using artificial intelligence are also a threat to Corporate brands.

These ‘deepfakes’ can used for direct theft, to spread information meant to move a company’s stock price, or to undermine a company’s relationship with customers.

Companies involved in businesses that depend on customer’s trust could be harmed by videos or audios highlight fake trust-violating misdeeds.

“Symantec identified at least three successful audio attacks on companies earlier this year, where scammers impersonated CEOs or chief financial officers’ voices, requesting an urgent transfer of funds. Millions of dollars were stolen from each business, whose names were not disclosed.”

Global financial services enterprises need to build or acquire muscle in this area now.

See Cat Zakrzewski’s recent outline of the topic at: “Businesses should be watching out for deepfakes too, experts warn.”

REFERENCES:
“The Technology 202: Businesses should be watching out for deepfakes too, experts warn.”
By Cat Zakrzewski, https://www.washingtonpost.com/news/powerpost/paloma/the-technology-202/2019/12/13/the-technology-202-businesses-should-be-watching-out-for-deepfakes-too-experts-warn/5df279f1602ff125ce5b2fe7/?tid=ss_mail


Poster Explores Cyber-Security Team Challenges

November 9, 2019

This is an excellent poster exploring some of the challenges facing the cyber security teams in any global financial services enterprise.
It was presented at the Workshop on Usable Security (USEC 2019), Internet Society, San Diego, California, Feb 24, 2019.

The author’s aims and objectives were:

  • Study the current practices followed by law enforcement and cybercrime investigators.
  • Identify the general characteristics related to the investigation process of cybercrimes.
  • Identify socio-technical challenges currently faced, and areas where new technologies and processes are necessary.
  • Formulate recommendations for designing and building usable security tools and workflows to improve their day-to-day operations.

This poster includes brief summaries of their questions, findings, and conclusions.

If you play any role in managing, leading, funding or reviewing any type of cyber security team (or think you may want to in the future), I strongly recommend adding it to your reading list.

REFERENCE:
“Cybercrime Investigators are Users too! — Understanding the Socio-Technical Challenges Faced by Law Enforcement.”
By Mariam Nouh, Jason R.C. Nurse, Helena Webb, and Michael Goldsmith
https://www.ndss-symposium.org/wp-content/uploads/2019/02/ndss2019posters_paper_23.pdf

 


Capital One Concerns Linked To Overconfidence Bias?

August 2, 2019

Earlier this week on July 29, FBI agents arrested Paige “erratic” Thompson related to downloading ~30 GB of Capital One credit application data from a rented cloud data server.

In a statement to the FBI, Capital One reported that an intruder executed a command that retrieved the security credentials for a web application firewall administrator account, used those credentials to list the names of data folders/buckets, and then to copy (sync) them to buckets controlled by the attacker.  The incident appears to have affected approximately 100 million people in the United States and six million in Canada.

If you just want to read a quick summary of the incident, try “For Big Banks, It’s an Endless Fight With Hackers.” or “Capital One says data breach affected 100 million credit card applications.”

I just can’t resist offering some observations, speculation, and opinions on this topic.

Since as early as 2015 the Capital One CIO has hyped their being first to cloud, their cloud journey, their cloud transformation and have asserted that their customers data was more secure in the cloud than in their private data centers.  Earlier this year the company argued that moving to AWS will “strengthen your security posture” and highlighted their ability to “reduce the impact of compliance on developers” (22:00) — using AWS security services and the network of AWS security partners — software engineers and security engineers “should be one in the same.”(9:34)

I assume that this wasn’t an IT experiment, but an expression of a broader Capital One corporate culture, their values and ethics.  I also assume that there was/is some breakdown in their engineering assumptions about how their cloud infrastructure and its operations worked.  How does this happen?  Given the information available to me today, I wonder about the role of malignant group-think & echo chamber at work or some shared madness gripping too many levels of Capital One management.  Capital One has to have hordes of talented engineers — some of whom had to be sounding alarms about the risks associated with their execution on this ‘cloud first‘ mission (I assume they attempted to communicate that it was leaving them open to accusations of ‘mismanaging customer data’, ‘inaccurate corporate communications,’ excessive risk appetite, and more).  There were lots of elevated risk decisions that managers (at various levels) needed to authorize…

Based on public information, it appears that:

  • The sensitive data was stored in a way that it could be read from the “local instance” in clear text (ineffective or absent encryption).
  • The sensitive data was stored on a cloud version of a file system, not a database (weaker controls, weaker monitoring options).
  • The sensitive data was gathered by Capital One starting in 2005 — which suggests gaps in their data life-cycle management (ineffective or absent data life-cycle management controls)
  • There were no effective alerts or alarms announcing unauthorized access to the sensitive data (ineffective or absent IAM monitoring/alerting/alarming).
  • There were no effective alerts or alarms announcing ‘unexpected’ or out-of-specification traffic patterns (ineffective or absent data communications or data flow monitoring/alerting/alarming).
  • There were no effective alerts or alarms announcing social media, forums, dark web, etc. chatter about threats to Capital One infrastructure/data/operations/etc. (ineffective or absent threat intelligence monitoring & analysis, and follow-on reporting/alerting/alarming).
  • Capital One’s conscious program to “reduce the compliance burden that we put on our developers” (28:23) may have obscured architectural, design, and/or implementation weaknesses from Capital One developers (a lack of security transparency, possibly overconfidence that developers understood their risk management obligations, and possible weaknesses in their secure software program).
  • Capital One ‘wrapped’ a gap in IAM vendor Sailpoint’s platform with custom integrations to AWS identity infrastructure (16:19) (potentially increasing the risk of misunderstanding or omission in this identity & access management ‘plumbing’).
  • There may have been application vulnerabilities that permitted the execution of server side commands (ineffective input validation, scrubbing, etc. and possibly inappropriate application design, and possible weaknesses in their secure code review practices and secure software training).
  • There may have been infrastructure configuration decisions that permitted elevated rights access to local instance meta-data (ineffective configuration engineering and/or implementation).
  • There must be material gaps or weaknesses in Capital One’s architecture risk assessment practices or in how/where they are applied, and/or they must have been incomplete, ineffective, or worse for a long time.
  • And if this was the result of ‘designed-in‘ or systemic weaknesses at Capital One, there seems to be room for questions about their SEC filings about the effectiveness of their controls supportable by the facts of their implementation and operational practices.

In almost any context this is a pretty damning list.  Most of these are areas where global financial services enterprises are supposed to be experts.

Aren’t there also supposed to be internal systems in place to ensure that each financial services enterprise achieves risk-reasonable levels of excellence in each of the areas mentioned in the bullets above?  And where were the regulations & regulators that play a role in assuring that it the case?

How does an enormous, heavily-regulated financial services enterprise get into a situation like this?  There is a lot of psychological research suggesting that overconfidence is a widespread cognitive bias and I’ve read, for example, that it underpins what is sometimes called ‘organizational hubris,’ which seems like a useful label here.   The McCombs School of Business Ethics Unwrapped program defines ‘overconfidence bias’ as “the tendency people have to be more confident in their own abilities than is objectively reasonable.”  That also seems like a theme applicable to this situation.  Given my incomplete view of the facts, it seems like this may have been primarily a people problem, and only secondarily a technology problem.  There is probably no simple answer…

Is the Capital One case unique?  Could other financial services enterprises be on analogous journeys?

REFERENCES:
“Capital One Data Theft Impacts 106M People.” By Brian Krebs. https://krebsonsecurity.com/2019/07/capital-one-data-theft-impacts-106m-people/
“Why did we pick AWS for Capital One? We believe we can operate more securely in their cloud than in our own data centers.” By Rob Alexander, CIO, Capital One, https://aws.amazon.com/campaigns/cloud-transformation/capital-one/ and https://youtu.be/0E90-ExySb8?t=212
“For Big Banks, It’s an Endless Fight With Hackers.” By Stacy Cowley and Nicole Perlroth, 30 July 2019. https://www.nytimes.com/2019/07/30/business/bank-hacks-capital-one.html
“Capital One says data breach affected 100 million credit card applications.” By Devlin Barrett. https://www.washingtonpost.com/national-security/capital-one-data-breach-compromises-tens-of-millions-of-credit-card-applications-fbi-says/2019/07/29/…
“AWS re:Inforce 2019: Capital One Case Study: Addressing Compliance and Security within AWS (FND219)” https://youtu.be/HJjhfmcrq1s
“Capital One Data Theft Impacts 106M People.” https://krebsonsecurity.com/2019/07/capital-one-data-theft-impacts-106m-people/
“Frequently Asked Questions.” https://www.capitalone.com/facts2019/2/
Overconfidence Bias defined: https://ethicsunwrapped.utexas.edu/glossary/overconfidence-bias
Scholarly articles for cognitive bias overconfidence: https://scholar.google.com/scholar?hl=en&as_sdt=1,16&as_vis=1&q=cognitive+bias+overconfidence&scisbd=1
“How to Recognize (and Cure) Your Own Hubris.” By John Baldoni. https://hbr.org/2010/09/how-to-recognize-and-cure-your

 


Facial-Identification Bias and Error – Lesson For AI/ML-Enabled Security Services

February 2, 2019

Six months ago I published a short rant about the potential for material bias and unknown error represented in the many AI/ML-driven security services being pitched to global financial services enterprises. Since that time, and in the most general terms my experiences in this field have been less-than positive. The central themes that I hear from proponents of these services seem to be — “you don’t get it, the very nature of AI/ML incorporates constant improvements,” and “you are just resistant to change.” There seems to be little appetite for investigating any of the design, implementation, and operational details needed to understand whether given services would deliver cost and risk-relevant protections — which should be in the foreground of our efforts to protect trillions of dollars worth of other people’s money. It is not. “Buy in, buster.” Then a quick return to what seems central to our industry’s global workforce — distraction. Ug.

Because of the scale of our operations and their interconnectedness with global economic activity, financial services risk management professionals need to do the work required to make ‘informed-enough’ decisions.

Recent assessments of leading facial-identification systems have shown that some incorporate material bias and error. In a manner analogous to facial recognition technologies, AI/ML-driven security analysis technology is coded, configured, and trained by humans, and must incorporate real potential for material bias and unknown error.

Expense pressures and an enduring faith in technology have delivered infrastructure complexity and attack surfaces that were unthinkable only a few years ago. Concurrently, hostile activity is more diverse and continues to grow in scale. We need to find new ways to deal with that complexity at scale in an operational environment that is in constant (often undocumented) flux. Saying “yes” to opaque AI/ML-enabled event/threat/vulnerability analysis services might be the right thing to do in some situations. Be prepared, though, for that day when your risk management operations are exposed to legal discovery… Will “I had faith in my vendor” be good-enough to protect your brand? That seems like a sizable risk. Bias and error find there way into much of what we do. Attempting to identify and deal with them and their potential impacts have part of global financial services risk management for decades. Don’t let AI/ML promoters eliminate that practice from your operations.

REFERENCES:
“Bias & Error In Security AI/ML.”
https://completosec.wordpress.com/2018/07/14/bias-error-in-security-ai-ml/

“Amazon facial-identification software used by police falls short on tests for accuracy and bias, new research finds.”
https://www.washingtonpost.com/technology/2019/01/25/amazon-facial-identification-software-used-by-police-falls-short-tests-accuracy-bias-new-research-finds/
https://www.washingtonpost.com/technology/amazon-facial-identification-software-used-by-police-falls-short-on-tests-for-accuracy-and-bias-new-research-finds/2019/01/25/fa74fbb5-1079-4cc1-8cb4-23d71bd238e2_story.html
By By Drew Harwell, 01-25-2019


Cloud Concentration Brings Operational Risks

July 18, 2018

The sales job included a theme that the Internet is so decentralized that anything named cloud simply inherited all the good ‘ilities’ that massive decentralization might deliver. One of the core tenets of that Internet story-telling is about the resiliency that decentralization guarantees — as in: no single outage can cause problems for your cloud-delivered services.

The reality is that story is too often just nonsense. Google, Amazon, and Microsoft control enormous amounts of the infrastructure and operations upon which “cloud” services depend. When any of these three companies has an outage, broad swaths of “cloud-delivered” business services experience outages as well. If you are going to depend on cloud-enabled operations for your success, use care in how you define that success, and how your communicate about service levels with your customers. This is especially problematic for industries where customers have been trained to expect extremely high service levels — global financial services enterprises fall into this category.  Factor this risk into your business & technical plans.

Yesterday, Tuesday, July 17th, Google’s services that provide computing storage and data management tools for companies failed. The customers who depended upon those foundational services also failed. Snapchat and Spotify were two high profile examples, but the outages were far more widespread. Google services that also depend upon the storage and data management tools also failed. It appears that Google Cloud Networking was knocked down as the company reported “We are investigating a problem with Google Cloud Global Load balancers returning 502s for many services including AppEngine, Stackdriver, Dialogflow, as well as customer Global Load Balancers.” This has broad impact because all customers attempting to deliver high quality services depend on load balancers. It appears that this networking issue caused downstream outages like those experienced by Breitbart and Drudge Report.

This was the 5th non-trivial outage this year for Google’s AppEngine. Google’s ComputeEngine has had 7 outages this year, 12 for their Cloud Networking, 8 for their Stackdriver, 3 for Google Cloud Console, 3 for Cloud Pub/Sub, 4 for Google Kubernetes Engine, 2 for Cloud Storage, 4 for BigQuery, 2 for their DataStore, 2 for their Cloud Developer Tools, and 1 reported for their Identity & Security services.
To add insult, it appears that the Google Enterprise Support page may have also not working during the outage on the 17th…

Amazon also experienced service failures that were widely reported the day before (although they are not documented on the company’s service health dashboard).
Microsoft had “internal routing” failures that resulted in widespread service outages for over an hour on the 16th as well.

Plan this reality into your cloud-enabled strategies, architectures, designs, implementations, testing, monitoring, reporting, and contracts.

July 31, 2018 UPDATE:

One of my peers asked about the ability of legacy financial services IT shops to deliver ‘cloud-like’ service levels, arguing that all of our global financial services enterprises have material entrenched IT infrastructure that has service level issues as well.  It seems like a fair reaction, so here is my response:
My motivation for the essay above was driven by several observations:
(1) There are still individuals in our industry who seem to believe there is some magical power available to ‘cloud’ things that will make existing problems and constraints disappear. These individuals make me tired. I tried to highlight a little of cloud vendor’s muggle nature.
(2) The three main cloud vendors I mention — Amazon, Google and Microsoft — provide what I perceive as selective and self-serving views into their service outages. They have outages, short and not-so-short, that are unreported on all of their outage-reporting interfaces available to me. That behavior seems to be designed into their development and infrastructure management practices as well as their sales/marketing practices. When they fail fast, it is on customer’s time and investments (ours), not only on their own (and, I believe, rapid recovery skills do not justify or compensate for any given outage…). I mentioned Amazon on this score, but I my reading helps me believe that all three exhibit this behavior.
(3) We all have ‘knobs’ or ‘levers’ available to us at our corporations to influence our service levels in ways that might help us distinguish ourselves from our competition. My working understanding is that service levels are just an expression of management will. I get it that those responsible for serious profit/loss decision-making have a fiendishly difficult role. That said, if it were important-enough, our leaders would adjust the available management ‘knobs’ in ways that do/can/would deliver whatever was needed — on our systems or those owned & managed by others. I understand that there are a hornet’s nest of competing priorities and trade-offs that make those decisions tricky. I also know that many of those ‘knobs’ are not available to our management teams in analogous ways for most of our ‘cloud’ vendor candidates.

Some argue that there are the right drivers to replace our systems with ‘cloud-enabled’ vendor services in combination with some amount of our code and configuration. To live out that desire, it seems like most of us would need to re-architect much of our business practices and in parallel the stacks of systems that enable them if we were expecting to deliver competitive sets of features and accompanying service levels in the global financial services enterprise arena. A material chunk of that re-architecting would need to compensate for the outage patterns exhibited by the ecosystem of major and minor players involved — and because this is my blog, to compensate for the new information security risk management challenges that path presents as well.

REFERENCES:
Google Cloud Status Dashboard
https://status.cloud.google.com/summary

Amazon Service Health Dashboard
https://status.aws.amazon.com/

Azure Status History
https://azure.microsoft.com/en-us/status/history/
Azure Status
https://azure.microsoft.com/en-us/status/
Office365 Health
https://portal.office.com/servicestatus

Google Cloud Has Disruption, Bringing Snapchat, Spotify Down
https://www.bloomberg.com/news/articles/2018-07-17/google-cloud-has-disruption-bringing-snapchat-spotify-down
By Mark Bergen, July 17, 2018

Spotify, Snapchat, and more are down following Google Cloud incident (update: fixed)
https://venturebeat.com/2018/07/17/discord-snapchat-and-more-are-down-following-google-cloud-incident/
Jeff Grubb, JULY 17, 2018

Google Cloud Platform fixes issues that took down Spotify, Snapchat and other popular sites
https://www.cnbc.com/2018/07/13/google-cloud-platform-reports-issues-snap-and-other-popular-apps-affe.html
Chloe Aiello, 07-17-2018

[Update: Resolved] Google Cloud has been experiencing an outage, resulting in widespread problems with several services
https://www.androidpolice.com/2018/07/17/google-cloud-experiencing-outage-resulting-widespread-problems-several-services/
By Ryne Hager, 07-17-2018

Google Enterprise Support page Outage Reference


Bias & Error In Security AI/ML

July 14, 2018

It is difficult to get through a few minutes today without the arrival of some sort of vendor spam including the use of artificial intelligence and machine learning (AI/ML) to analyze event/threat/vulnerability data and then provide actionable guidance, or to perform/trigger actions themselves.

Global financial services enterprises have extreme risk analysis needs in the face of enormous streams of threat, vulnerability, and event data. While it might seem attractive to hook up with one or more of these AI/ML hypsters, think hard before incorporating these types of systems into your risk analysis pipelines.  At some point they will be exposed to discovery — and at that point is there risk to your brand?

In a manner analogous to facial recognition technologies, AI/ML-driven security analysis technology is coded, configured, and trained by humans, and must incorporate the potential for material bias and unknown error.

Microsoft recently called for regulation of facial recognition technology and its application.  I don’t know if regulation is the appropriate path for AI/ML-driven security analysis technologies.  I think that we do, though, need to remain aware of the bias and error in our implementations — and protect our employers from unjustifiable liability risks on this front.  Demand transparency and strong evidence of due diligence from your vendors, and test, test, test.

References:

“Facial recognition technology: The need for public regulation and corporate responsibility.” Jul 13, 2018, by Brad Smith – President and Chief Legal Officer, Microsoft
https://blogs.microsoft.com/on-the-issues/2018/07/13/facial-recognition-technology-the-need-for-public-regulation-and-corporate-responsibility/

“The Future Computed – Artificial Intelligence and its role in society.”
By Microsoft.
https://blogs.microsoft.com/uploads/2018/02/The-Future-Computed_2.8.18.pdf

“Amazon’s Facial Recognition Wrongly Identifies 28 Lawmakers, A.C.L.U. Says.”
By Natasha Singer, July 26, 2018
https://www.nytimes.com/2018/07/26/technology/amazon-aclu-facial-recognition-congress.html (added/updated 07-27-2018)


%d bloggers like this: