Friday 14th January marked an interesting, yet somewhat unheralded day in Cloud with this annoucement from Google that they are to become “the first major cloud provider to eliminate maintenance windows from their service level agreement.” To paraphrase, the folks in Google’s operations teams are focusing on a target of zero downtime, planned or otherwise, and in doing so, seem to be making the statement that “continuous uptime” is something they hope will become a key differentiator in their bid to own the multi-gazillion dollar SaaS-based productivity apps market. I wonder if they’ll coax Billy Joel into a re-work of his 80′s classic, taking the title of this post, with the hope that the video goes viral on YouTube ?
Following this announcement, Cloudave’s Krish Subramanian posted a tantalizing blog entry which skilfully posed the following question:
Although it’s hard to tell whether the question posed was a call to arms or a rhetorical acceptance of the current state of enterprise thinking, Subramanian articulates a very salient point, and one that in my experience, doesn’t seem to be far from the uncomfortable front of the minds of many CIOs considering cloud services as truly viable solutions – is it better or worse than my current environment?
I know, before you say it that there is clearly a problem with that question.
As Chris Hoff has inferred many times via his Rational Survivability blog and specifically in his excellent presentation entitled “Cloudifornication”, the question of “is the cloud more secure?” can only be answered by the question “more secure than what?“. In a parallel universe, the question “is it better or worse than my current environment?” can only be answered with “how bad is your current environment?”. Quid Pro Quo.
Diving a little deeper in my thought process, I am not sure if the key word troubling CIOs is “availability” or “reliability” or both. Following an incredibly short but detailed twitterburst of activity from the ever-willing (and incredibly smart) members of the Clouderati, I began to wonder if those two words (along with the cringe-inducing notion of the SLA) are actually taking on different contexts as we look to find the right balance between the new breed and old school of business-enabling, agile, cost-efficient and predictable services.
Let’s take a look at the areas.
I’ll start by taking an unadulterated swipe at the utterly ridiculous notion of the SLA. Irrespective of whether the SLA is external (supplier to customer) or internal (IT to business) the very premise of the traditional SLA makes little practical sense. In the external example, the legalese is (understandably) biased on the side of the supplier and it’s likely the “guarantee” will not cover some of the things you would like to see “measured” as part of the service. In the internal case, unless IT is an outsourced environment (which I would suggest is customer to supplier), wouldn’t it be better to simply gather key metrics and have them available as a monthly report ? I can’t imagine there is an organization in the world who hasn’t got better things to do than create an over-complex, unrealistic set of objectives that ultimately have little or no consequence if the targets are not met. Puzzling.
Having had the honor of formulating and managing some extremely large IT service contracts over the years, and managing operational activities for a very large global organization, I can not even begin to calculate the number of hours (and therefore $$$) that have gone into to the review from internal and external legal teams, and for what exactly ? A guarantee of “the best we will ever perform” as a provider with a credit / remuneration clause that is in no way comparative to any potential loss of revenue / earnings. No provider on the planet is going to align their SLA terms with any true-to-life “unavailability to loss” scenario. Ideally, we would want the contractual clauses to be on a par with the concepts of liquidated damages. Sadly, that’s never going to happen.
We would be far better replacing the “A” for “Agreement” and replacing it with “E” for “Expectation”. That way, we could retain transparency and create freedom of choice without the need for an investment in a legal sign-off for something that is so fundamentally misaligned that it is utterly worthless.
In summary, I don’t see how an SLA, especially an uptime SLA, can be a major differentiator. It’s about performance, not promises.
Continuing with the “on-premise versus off-premise” email scenario as described by Submaranian, I believe this area is where a significant amount of confusion (maybe even FUD) lies when discussing the merits and pitfalls of cloud. Consider the organization that steadfastly sticks to their internal deployment of MS Exchange versus the organization that moves their entire workforce email to, say, Google Apps – let’s look at some numbers.
“In 2010, Gmail was available 99.984 percent of the time, for both business and consumer users. 99.984 percent translates to seven minutes of downtime per month over the last year. Seven minutes of downtime compares very favorably with on-premises email, which is subject to much higher rates of interruption that hurt employee productivity. Our calculations suggest that Gmail is 32 times more reliable than the average email system, and 46 times more available than Microsoft Exchange (Source : Google)
Confused ? Yeah, me too. 32 times more reliable and 46 more times available. But here’s a thought – if you lose your network connectivity from “the premise” then both systems are pretty much useless as far as their core function of sending and receiving external email is concerned. No network, so SMTP, no email. What’s the difference between available and reliable in the wording above ?
Like it or not, global inter-company email is predicated upon what used to be depicted as THE cloud with funky yellow lightning bolts. The public internet. If you can’t connect to it, you’re going to suffer, even if your email is within your four walls. Therefore, I would suggest that the above is an availability discussion, rooted firmly in the same architectural and operational DNA as the “availability of central IT services at the branch office via the company WAN” and is far more likely to be a point-of-failure discussion at the “consumer” end, rather than “provider” end. I am not sure cloud can be pointed to as the only example of where this becomes the key consideration.
Unlike the availability discussion above, I do believe that the question of reliability is much more in the realm of responsibility of the provider and therefore, the enterprise CIO may be *wiser* to focus time and energy, where possible, in comparing the performance metrics of competitors in a given “service space” (whichever aaS) with that of available benchmarks from his or her own environment.
One of the major problems facing cloud providers is the fallout from the loss of service to multiple tenants (customers) when something catastrophic happens. For the bigger players, this usually includes bad press and in some cases, reputation damage – both of which add to the reluctance for organizations to move to cloud as it is further deemeed “unreliable”. It is often forgotten is that this type of catastrophic outage affects private cloud environments (I know because I live it every day) in a similar way, as multiple business units become multi-tenant configurations and are served via one or more global data centers. Unless there is true “hot standby” in the private cloud (which I am yet to see) then the effect of a catastrophic outage is felt across more than a single set of “ringfenced” users.
And then, of course, there is the “traditional data center”, with all its inherent problems and operational headaches. They tend to enjoy life “under the radar” today as outages, although prolonged in cases, only affect certain parts of the business and certain applications. Although these outages can be very damaging to productivity, they do not receive anything like the negative publicity of their cloudy counterparts.
There is an unfortunately macabre analogy that helps bring this thought process together:
The number of US highway deaths in a typical six month period – around 21,000 – roughly equals all commercial jet fatalities worldwide since the dawn of jet aviation over four decades ago. In fact, fewer people have died in commercial airplane accidents in America over the past 60 years than are killed in US automobile accidents in any typical three-month time period. (Source : Boeing Corporation)
It’s very infrequent to hear of a road crash (the traditional data center) make national news, but in the event of a commercial jet crash (the cloud) then it’s guaranteed to make headlines. Perhaps this is simply due to the number of people affected on board the airliner at a single time during the incident ?
So, assuming that the network connectivity at the “consumer” end remained available, then the problems with reliability are likely to be confined to the “provider” end (private or public).
It is (and will be) interesting to see what happens when enterprises struggle with “issues” arising from the movement of application workloads from traditional data centers (where they may or may not be virtualized) to IaaS providers. As organizations move forward, this will almost certainly occur and can not be attributed to a problem with the availability nor reliability of the consumer or provider. This is a phenomenon that I have nicknamed “I2″ (Instability & Incompatibility) – more to follow on that in a later post.
Availability may be considered in the same way as the traditional thinking around MTTR (Mean Time To Recover / Repair) metrics
Reliability may be considered in the same way as the traditional thinking around MTBF (Mean Time Between Failure) metrics
Both of the above will matter, do matter, during the endless discussions, debate and deliberation around movement to cloud. I believe they are equally important depending on the context and timeline of your individual discussion, but I do not believe they are equally important and equally weighted considerations at all points during the challenge of the cloud conundrum.
If you’re after a great single pane of glass view on public cloud status and benchmarks, then the folks at Cloudharmony.com have a pretty good dashboard which details many public cloud providers.
My final piece of advice ? Be methodical, be thorough, be data-driven but when you feel that you’re picking your way through shark infested waters, don’t ever let the SLA be your navigational beacon.