1 vs. 100891344545564193334812497256, part I (on Enterprise IT, disorder, QoS & SLA)
The numbers in the title of this post represent combinations of order and disorder in a one hundred elements' system. When all hundred elements are up – system's in order. If one element out of the hundred is down, the system is in a disordered state. Of course, more than one element can be down at any one time.
A question: how many combinations of elements up/down in a 100 elements' system are representing an ordered state, and how many a disordered state? (A hint: answers are presented in the title…)
Sadly enough, disorder outnumbers order, and that is an understatement.
(Actually, the number of ordered states is two and not one. Mathematically speaking, if all 100 elements are down, the system is said to be in order. Business-wise, though, the fact that the system rests in peace cannot help in generating revenues – so I took 1 off of the two possibilities…).
Now, the last thing you should do is get panicked, as all multi-components systems exist with an inherent chaotic state inside ("Chaos Inside" for large IT shops could be a nice slogan). And as Benoit Mandelbrot, fractals, nicely suggested, there's no point in fighting against the chaotic nature of complex systems. One has to learn how to live with it, minimizing the noise of chaos to the lowest possible level. It's like good and evil – we have to live with them both, and to do our best.
The point I'd like to argue, after this intro, is that nowadays, dynamic Enterprise ITs are by nature un-QoS-able and un-SLA-able. If there's an Enterprise in which distributed business processes are demonstrating 99.99% or 99.999% availability, along with excellent performance, then I'd say it's the music of chance (unless a special, proprietary, "counter-chaos infrastructure" has been built for a very specific, limited and controlled business processes).
I remember trying to explain to my Engineering-oriented executives why is IT so unstable in comparison with the network elements they were familiar with. "Inside the switch there is also software and it's never down", they claimed.
Here's what I told them:
Hardware vendors differ from Enterprise IT shops in two major dimensions: testing & changes. Hardware vendors perform extensive product testing as well as interoperability testing (i.e. the certification matrix). I have visited, early this year, the EMC headquarters in Boston, and had the opportunity to see their QA facilities. They invest 200 million US$ a year, so I've been told, just on interoperability testing, and I would double these figures for the whole quality chain. That's a good amount of money invested every year to minimize the chaotic and disordered state of their products.
What hardware vendors and ISVs do in their testing is to set up a fixed number of controlled and static environments, in which all elements and their configurations are well known and supervised. The tests are checking each of the systems' sub-components, their integration, interoperability and performance.
It is the static, walled-garden nature of the configurations that enables the issuance of certifications.
This pattern of static, well-known configurations, haven't changed in years. And it has been shared, some years ago, also by Enterprise IT shops, but not any longer.
The internet enabled a proliferation of customer touch-points and a much greater variations in the company's portfolio of products & services. Consequently, a holistic view of the customer has become a necessity. If previously, (real) customers were "forwarded" to local-application business functionality, they are now forwarded to cross-applications business processes that enable (digital) customer (as well as other business entities) homogeneity. These cross-applications business processes are the complex systems of nowadays Enterprises – not (just) the business applications that support them.
Having said that, the next intuitive question one should ask is "what could possibly go wrong in the underlying components of these business processes?" and the answer is, of course – everything.
Some of the reasons for things to go wrong with business processes are:
a. Business processes are not formally captured. They are automated, true, but their formal description is kept mostly in word documents or in people's heads. Try getting, in your own enterprise, a report with an up-to-date inventory of the Enterprise's business processes. Things that do not formally exist are very hard to manage and control.
b. Business processes are neither formally mapped to their sub-components, nor to each other. Without formal mapping there is hardly a way to tell the impact of a change in a data center element on the complex of business processes.
c. Even if such formal mapping would have been in place, there is no way that I'm aware of, to assess the performance (QoS) impact of a data center element change on the business processes related to it. When a change is introduced to a business application (SAP, Siebel, Amdocs etc), load testing are usually measuring the impact of the change on the direct users of the business application (i.e. clients, batch processes etc). Load testing is not, and probably cannot easily assess the impact of a functional change on external, cross-applications business process that access the business application's APIs.
d. Unlike a hardware system or an ISV system, where changes by the Enterprise are highly constraint and limited to one or two places, changes to the underlying components of a business process can occur in all levels and in many different ways. As a single business process incorporates business applications, servers, databases, storage devices, network cards & routers, storage directors etc., it actually becomes a mini-IT. Cross-components associations (aka, links) can be added, replaced, removed. Changes to the components themselves happen regularly.
e. QA is not, and will never be, as optimal as it is in hardware or ISV companies. Time To Market pressure makes "Green pass" to production (i.e. no QA) popular than ever; budgetary pressure on cutting IT costs will always translates into cutting in testing rather than cutting in functionality, as testing is perceived as a kind of insurance, while functionality is the make-or-break of the company. Also, QA is an internal IT affair; functionality is an external IT affair. Again, political forces influence IT reality.
So business processes are the new focus of Enterprises, and yet they are as fragile as one could imagine.
This alone demonstrates the problems in guaranteeing SLA or in committing to QoS. I would like, nevertheless, to elaborate a bit further on Web Services, which are the natural building blocks of nowadays business process. It is also in the Web Services sphere, that vendors supporting/promoting the WS standards are specifically mentioning SLA. We'll see in the post to follow, if web services can bring some salvation to the problems raised here.