Wednesday, August 31, 2005

1 vs. 100891344545564193334812497256, part II (on Enterprise IT, Disorder, QoS & SLA)

This is the second post on Enterprise IT, Disorder, QoS and SLA.

If you thought things are complicated, wait till you encounter Web Services.
Web Services are making things worse.

Enclosed are Exhibits A and B, taken from an IBM article on the subject, named Use SLAs in a Web services context, Part 1: Guarantee your Web service with a SLA.

Exhibit A:

Figure 1. Architecture for a Web service covered by a SLA

What's described up here is a process in which an application consults the service broker on registered web services, which besides providing the needed functionality are also meeting the client application SLA requirements. All these negotiations and bindings are occuring in real-time, i.e. a conditional dynamic binding of a web service.
But what I see here, is the creation of a complex system on-the-fly. No counter-chaos methods applied, no QA, no interoperability checks – nothing. Pure plug and play, based on declarations (btw, the availability figures discussed in the article are all below 99.9% [or 3 nines, 500 minutes of outage per year]. In my previous position at Orange, 3 9s were no longer accepted by the business!)

And now for Exhibit B: reasons for not complying with the promised SLA, taken from the same article. All remarks inside paranthesis are of the article author! I just bolded some of the words.

a. Failures

Hardware failure (note that faulty hardware is rare), Telecommunication failure (for example, a provider accidentally cuts a fiber line), Software bugs/flaws, Monitoring/measurement system failure

b. Network issues not within direct control of service provider

Backbone peering point issues (for example, UUnet has a router in California go down, denying Internet services to the entire West Coast)
DNS issues not within the direct control of the service provider

c. Denial of Service

Client negligence/willful misconduct
Network floods, hacks, and attacks
Acts of God, war strikes, unavailability of telecommunications, inability to get supplies or equipment needed for the provision of the SLA

d. Scheduled Maintenance

Hardware upgrades
Software upgrades

While all these failures can happen, they are not considered a violation of the SLA! But, hey - all the reasons I marked in bold seem highly familiar to me. I'd say they cover almost all possible failures occuring in Enterprises on a daily basis.

And even if we concentrate just on those "legal failures", i.e. failures to which the Web Services SLA proposal is ready to commit, the commitment relates to minutes of downtime across a period of one year. Let's take 4 9s – 50 minutes per year. We can assume that applications binding to a 99.99% service are mission-critical. But wait a (50) minute(s)! Nothing in the SLA is guaranteeing an even distribution of these 50 minutes across the whole year. These 50 minutes of downtime can happen just now for the next 50 minutes! Will your mission critical, financial application, in charge of millions of dollars transactions, accept a 50 minutes downtime? I doubt (but it's legal...)

And to be really on the safe side of "SLA violators getting clean in court", the article honestly reveals the web services SLA inaptitude to guarantee performance, i.e. QoS: "While SLAs focus on maximum upload availability and guaranteed bandwidths (I didn't understand how, mk), SLAs cannot guarantee consistent response times for latency-sensitive Web service applications."

I'm confused! Here's an SLA agreement that clears itself from all performance problems, as well as from all availability issues known to me. I'll go now to check my outsourcing agreement, probably I have missed something there too...

I'd sum up by saying that life is tough, and Enterprise IT proves it. Web Services pretend to solve all Enterprise IT issues by providing a lot of WS-* standards. Don't buy it straightforwardly! Be suspicious. I'd borrow from what Adam Bosworth, Google's VP of Engineering, had said on Xquery and apply it to probably most of the WS-stack: "Anything that takes four years isn't worth doing".

Short recap of the problem: business processes are becoming complex, disordered systems on their own. Enterprise IT is lacking the infrastructures and the machines to minimize the noise of chaos inherent to any complex system of that scale. Hence, any commitment to Business Processes SLA & QoS is doubtful.

In the meantime, Enterprises can only settle with the now proliferating post-mortem SLA monitoring tools for business processes. Guaranteed QoS & SLA is still to be dreamed of.


Post a Comment

<< Home