Wednesday, August 31, 2005

1 vs. 100891344545564193334812497256, part II (on Enterprise IT, Disorder, QoS & SLA)

This is the second post on Enterprise IT, Disorder, QoS and SLA.

If you thought things are complicated, wait till you encounter Web Services.
Web Services are making things worse.

Enclosed are Exhibits A and B, taken from an IBM article on the subject, named Use SLAs in a Web services context, Part 1: Guarantee your Web service with a SLA.


Exhibit A:

Figure 1. Architecture for a Web service covered by a SLA



What's described up here is a process in which an application consults the service broker on registered web services, which besides providing the needed functionality are also meeting the client application SLA requirements. All these negotiations and bindings are occuring in real-time, i.e. a conditional dynamic binding of a web service.
But what I see here, is the creation of a complex system on-the-fly. No counter-chaos methods applied, no QA, no interoperability checks – nothing. Pure plug and play, based on declarations (btw, the availability figures discussed in the article are all below 99.9% [or 3 nines, 500 minutes of outage per year]. In my previous position at Orange, 3 9s were no longer accepted by the business!)

And now for Exhibit B: reasons for not complying with the promised SLA, taken from the same article. All remarks inside paranthesis are of the article author! I just bolded some of the words.

a. Failures

Hardware failure (note that faulty hardware is rare), Telecommunication failure (for example, a provider accidentally cuts a fiber line), Software bugs/flaws, Monitoring/measurement system failure

b. Network issues not within direct control of service provider

Backbone peering point issues (for example, UUnet has a router in California go down, denying Internet services to the entire West Coast)
DNS issues not within the direct control of the service provider

c. Denial of Service

Client negligence/willful misconduct
Network floods, hacks, and attacks
Acts of God, war strikes, unavailability of telecommunications, inability to get supplies or equipment needed for the provision of the SLA

d. Scheduled Maintenance

Hardware upgrades
Software upgrades
Backups

While all these failures can happen, they are not considered a violation of the SLA! But, hey - all the reasons I marked in bold seem highly familiar to me. I'd say they cover almost all possible failures occuring in Enterprises on a daily basis.

And even if we concentrate just on those "legal failures", i.e. failures to which the Web Services SLA proposal is ready to commit, the commitment relates to minutes of downtime across a period of one year. Let's take 4 9s – 50 minutes per year. We can assume that applications binding to a 99.99% service are mission-critical. But wait a (50) minute(s)! Nothing in the SLA is guaranteeing an even distribution of these 50 minutes across the whole year. These 50 minutes of downtime can happen just now for the next 50 minutes! Will your mission critical, financial application, in charge of millions of dollars transactions, accept a 50 minutes downtime? I doubt (but it's legal...)

And to be really on the safe side of "SLA violators getting clean in court", the article honestly reveals the web services SLA inaptitude to guarantee performance, i.e. QoS: "While SLAs focus on maximum upload availability and guaranteed bandwidths (I didn't understand how, mk), SLAs cannot guarantee consistent response times for latency-sensitive Web service applications."

I'm confused! Here's an SLA agreement that clears itself from all performance problems, as well as from all availability issues known to me. I'll go now to check my outsourcing agreement, probably I have missed something there too...

I'd sum up by saying that life is tough, and Enterprise IT proves it. Web Services pretend to solve all Enterprise IT issues by providing a lot of WS-* standards. Don't buy it straightforwardly! Be suspicious. I'd borrow from what Adam Bosworth, Google's VP of Engineering, had said on Xquery and apply it to probably most of the WS-stack: "Anything that takes four years isn't worth doing".

Short recap of the problem: business processes are becoming complex, disordered systems on their own. Enterprise IT is lacking the infrastructures and the machines to minimize the noise of chaos inherent to any complex system of that scale. Hence, any commitment to Business Processes SLA & QoS is doubtful.

In the meantime, Enterprises can only settle with the now proliferating post-mortem SLA monitoring tools for business processes. Guaranteed QoS & SLA is still to be dreamed of.

0 Comments

Post a Comment

Tuesday, August 30, 2005

1 vs. 100891344545564193334812497256, part I (on Enterprise IT, disorder, QoS & SLA)

The numbers in the title of this post represent combinations of order and disorder in a one hundred elements' system. When all hundred elements are up – system's in order. If one element out of the hundred is down, the system is in a disordered state. Of course, more than one element can be down at any one time.

A question: how many combinations of elements up/down in a 100 elements' system are representing an ordered state, and how many a disordered state? (A hint: answers are presented in the title…)

Sadly enough, disorder outnumbers order, and that is an understatement.
(Actually, the number of ordered states is two and not one. Mathematically speaking, if all 100 elements are down, the system is said to be in order. Business-wise, though, the fact that the system rests in peace cannot help in generating revenues – so I took 1 off of the two possibilities…).

Now, the last thing you should do is get panicked, as all multi-components systems exist with an inherent chaotic state inside ("Chaos Inside" for large IT shops could be a nice slogan). And as Benoit Mandelbrot, fractals, nicely suggested, there's no point in fighting against the chaotic nature of complex systems. One has to learn how to live with it, minimizing the noise of chaos to the lowest possible level. It's like good and evil – we have to live with them both, and to do our best.

The point I'd like to argue, after this intro, is that nowadays, dynamic Enterprise ITs are by nature un-QoS-able and un-SLA-able. If there's an Enterprise in which distributed business processes are demonstrating 99.99% or 99.999% availability, along with excellent performance, then I'd say it's the music of chance (unless a special, proprietary, "counter-chaos infrastructure" has been built for a very specific, limited and controlled business processes).

I remember trying to explain to my Engineering-oriented executives why is IT so unstable in comparison with the network elements they were familiar with. "Inside the switch there is also software and it's never down", they claimed.

Here's what I told them:

Hardware vendors differ from Enterprise IT shops in two major dimensions: testing & changes. Hardware vendors perform extensive product testing as well as interoperability testing (i.e. the certification matrix). I have visited, early this year, the EMC headquarters in Boston, and had the opportunity to see their QA facilities. They invest 200 million US$ a year, so I've been told, just on interoperability testing, and I would double these figures for the whole quality chain. That's a good amount of money invested every year to minimize the chaotic and disordered state of their products.

What hardware vendors and ISVs do in their testing is to set up a fixed number of controlled and static environments, in which all elements and their configurations are well known and supervised. The tests are checking each of the systems' sub-components, their integration, interoperability and performance.
It is the static, walled-garden nature of the configurations that enables the issuance of certifications.

This pattern of static, well-known configurations, haven't changed in years. And it has been shared, some years ago, also by Enterprise IT shops, but not any longer.

The internet enabled a proliferation of customer touch-points and a much greater variations in the company's portfolio of products & services. Consequently, a holistic view of the customer has become a necessity. If previously, (real) customers were "forwarded" to local-application business functionality, they are now forwarded to cross-applications business processes that enable (digital) customer (as well as other business entities) homogeneity. These cross-applications business processes are the complex systems of nowadays Enterprises – not (just) the business applications that support them.

Having said that, the next intuitive question one should ask is "what could possibly go wrong in the underlying components of these business processes?" and the answer is, of course – everything.

Some of the reasons for things to go wrong with business processes are:

a. Business processes are not formally captured. They are automated, true, but their formal description is kept mostly in word documents or in people's heads. Try getting, in your own enterprise, a report with an up-to-date inventory of the Enterprise's business processes. Things that do not formally exist are very hard to manage and control.

b. Business processes are neither formally mapped to their sub-components, nor to each other. Without formal mapping there is hardly a way to tell the impact of a change in a data center element on the complex of business processes.

c. Even if such formal mapping would have been in place, there is no way that I'm aware of, to assess the performance (QoS) impact of a data center element change on the business processes related to it. When a change is introduced to a business application (SAP, Siebel, Amdocs etc), load testing are usually measuring the impact of the change on the direct users of the business application (i.e. clients, batch processes etc). Load testing is not, and probably cannot easily assess the impact of a functional change on external, cross-applications business process that access the business application's APIs.

d. Unlike a hardware system or an ISV system, where changes by the Enterprise are highly constraint and limited to one or two places, changes to the underlying components of a business process can occur in all levels and in many different ways. As a single business process incorporates business applications, servers, databases, storage devices, network cards & routers, storage directors etc., it actually becomes a mini-IT. Cross-components associations (aka, links) can be added, replaced, removed. Changes to the components themselves happen regularly.

e. QA is not, and will never be, as optimal as it is in hardware or ISV companies. Time To Market pressure makes "Green pass" to production (i.e. no QA) popular than ever; budgetary pressure on cutting IT costs will always translates into cutting in testing rather than cutting in functionality, as testing is perceived as a kind of insurance, while functionality is the make-or-break of the company. Also, QA is an internal IT affair; functionality is an external IT affair. Again, political forces influence IT reality.

So business processes are the new focus of Enterprises, and yet they are as fragile as one could imagine.

This alone demonstrates the problems in guaranteeing SLA or in committing to QoS. I would like, nevertheless, to elaborate a bit further on Web Services, which are the natural building blocks of nowadays business process. It is also in the Web Services sphere, that vendors supporting/promoting the WS standards are specifically mentioning SLA. We'll see in the post to follow, if web services can bring some salvation to the problems raised here.

0 Comments

Post a Comment

Monday, August 29, 2005

The Rise of The Machines: a new approach to Enterprise Architecture

There's already a considerable number of Enterprise Architecture Frameworks around, accompanied by yet a greater number of Enterprise Architecture definitions. Here's one of those: "EA is a tool to find potential savings hidden in organizations". And yet more definitions and frameworks are conceived almost every quarter or so. Last month, I was asked if I would be ready to summarize a brand new Framework from a leading analysts firm.

It reminded me of a similar endeavor I've engaged myself some five years ago, trying to get hold of TOGAF and other federal government agencies books, schemes and diagrams.

There's something in these guides which does not appeal to me. Probably it's in their organized, methodological nature, and their pretended illusion that once you follow their formulas, draw all the diagrams and establish all proposed committees and procedures - life's going to be better. It won't.

Another thing I've never get along with was the deliverables of these endeavors, which usually summed up in a bunch of reference models, and a massive amount of trees turned into papers. Technical Reference Model is a great example I always looked at in an amazement and admiration; another, more popular, is the Systems Architecture Reference, that maps all systems, their inter-relations etc. Most of these references are the outcome of a mapping activity and they simply do not fit with the dynamic reality we live in. If Enterprise IT nowadays is organically similar to the internet, you can understand the futility of a one-time mapping activity. To capture Enterprise IT blueprints, there could be two options: to have google bots constantly crawling the Enterprise; or to build a new kind of architecture for software development and IT operation life-cycle. A unified life-cycle architecture!

So I am not a great fan of architecture committees, guidelines, procedures and mappings – these are all post-mortem, human attempts to take a snapshot of an ever changing reality. The only thing I do believe in (I'm extreme, I'm aware to that…) are the Machines.

I'd argue that humans are incapable of coping with today's Enterprise IT complexity by using guides and procedures. They can conceive and build, though, the Machines that will be able to manage things around.

I'd like, therefore, to suggest the following definition for Enterprise Architecture:

Enterprise Architecture is an infrastructure and a set of Machines constructed in order to manage a chaotic, dynamic, unpredictable, complex, organic, prone to error, frustrating, Enterprise IT, which has to support an ever increasing, dynamic portfolio of products and services, through constant "ASAP, Now, Right-Away" modifications of business processes.

It's with this kind of Enterprise Architecture, that Companies would be able (hopefully) to achieve competitive advantage, increased revenues and operational efficiency. I hope I didn't miss any buzzword.

Do note: procedures and policies should be part of the infrastructures! No papers please! Do not expect humans to read Enterprise Architecture papers, nor to follow them (I hope, though, that they are reading this post :) ).

In my next posts I promise to elaborate.



Fresh Enterprise Architecture documents
waiting to be circulated among IT people



Worried IT people taking the law into their hands

0 Comments

Post a Comment

Friday, August 26, 2005

The socio-politics of SOA

For years I have earned my living in creating P2P solutions for integration problems. P2P stands for both point-to-point and peer-to-peer. The technical architecture used back then was point to point; the responsibility for the solution was peer-to-peer.

Point-to-point integration has a strong tactical essence. Two programmers from two different departments find themselves in an ad-hoc team that has to realize ASAP (it's always ASAP) a business process that spans across the two departments' information systems. P#1 and P#2 figure out what has to be done: P#1 has to code 1-2 functions and P#2 has to introduce in his/her application some calls to those newly added functions by P#1. Probably the DBA gets involved too, as some data replication might be needed. Once they're through, QA goes in.

The QA team got many questions, and they might encounter many issues during their testing. But P#1 & P#2 are always around; they assume responsibility for their bilateral ad-hoc solution and are always helping when needed. When the code goes into production, our two good Ps are still there as 3rd level support. In general, production problems are not politically beneficial. P#1 and P#2, being constantly aware of their public image, are doing the best they can to support their solution. That's the peer-to-peer, or programmer-to-programmer nature of the point-to-point architecture.

P2P architecture is an Enterprise Architecture nightmare. Links of all kinds are created in an ad-hoc and uncontrolled fashion among pieces of codes. These links grow organically and prosper, just like URLs, without central control. Hubs of links are created around major functions and eventually Enterprise IT applications turn to be a single organic, complex system, as A.L. Barabasi interestingly describes in his book LINKED: The New Science of Networks. And there are no bots to perform a post-mortem link analysis like there's over the internet, so no one actually knows what is linked to what.

Still, in this mess, each one is aware of his/her 1st degree of links (just like in LinkedIn) – and that has its moral and business merits as previously described.

Service Oriented Architecture helps a lot in the removal of disordered links and the reestablishment of Enterprise Order. (Are p2p links equal the disordered del.icio.us tags, while Enterprise SOA represents yet another ontological desire to have one, decisive, in-control view of the world (Enterprise)? Are we seeing again the same patterns we've discussed in previous posts? I'd say we do. I'd say that management and control in a distributed world – and IT has become highly distributed in the past 5 years– will always face the same issues of chaos vs. order).

But one thing no one tells you about SOA is that it removes the chaotic links, along with the simple, intuitive, human (bottom-up) responsibility. Programmers are no longer peers. The direct and human relations that used to be between two point-to-pointers have gone. Once a service gets published in the services' registry, anyone with the right permissions can get a description of the service and invoke it in a standard, well-known manner.

But see what is happening to our faithful, passionate and responsible P#1 & P#2. Previously P#1 knew whoever touches his code; he was, most probably, part of the project as well. He was the one that explained, trained and gave permissions to use his "precious". But now, the SOA wall is erected between P#1 and the rest of the world. Under an Enterprise SOA, P#1 is becoming insignificant, losing his political basis. A service that is actually hiding his code can now be used without any of his intervention, as the registry is holding whatever information needed for the world to know how to utilize the service. And once P#1 and P#2 understand their new political situation they automatically become careless. As if the service mediates, not just the implementation behind its WSDL, but also the humans resposnsible for it.

Suddenly, testing and production support of composite applications (i.e. applications that orchestrate existing [web] services) become highly difficult. The invisibility introduced by the service façade makes the QA team less capable of pin-pointing the exact code that's doing the problem. When they discover a bug, P#1 and P#2 are really hard to trace down. Sometimes they discover that P#1 and P#2 are no longer in the company. Same goes for production issues.

I am a great believer of SOA; I have designed an SOA Hub that serves 250 applications with hundreds of services and millions of service invocations a day. I am telling my stories, so Enterprise SOA acceptance will become smoother and easier. I believe the difficult times SOA is facing in Enterprises is partially because of all those things vendors are not telling their customers.

What I've described here is something we have greatly suffered from in the first year of SOA implementation. It took us some time to figure out why the QA teams are complaining about SOA; why operation teams are highly dissatisfied with their problem tracking tools. But once we understood that P2P is radically different from SOA – not just in its technical aspects, but also in its socio-political aspects - we have developed special training packages that touched specifically these issues and we provided a new set of life-cycle administration tools, that incprporated knowledge we would normally ask P#1 & P#2 for, thus allowing QA and operation teams to accomplish their own tasks, without chasing down our good old Ps.

3 Comments

By Blogger gonen, at 8:51 AM  

hi muli . Nice blog . I will make time to read it , I'm sure i will be able to learn a lot .

amihay gonen

By Anonymous Andrew S. Townley, at 10:00 PM  

Hi Muli,

I've been reading several of your posts after stumbling across your blog (from where, I haven't a clue at this stage). Anyway, I'm also involved in an SOA project at the moment, so I know where you're coming from. Ours isn't as big as yours, but the potential's there.

What I wanted to point out is you're dead right about the social dynamics of SOA not being the same as they were in the past. However, I think it's not too surprising when you think about the driving force behind SOA: the service.

Services cannot be written like "internal integration" applications anymore. In reality, they're "shrinkwrapped software" (see Joel on Software's Five Worlds essay: http://www.joelonsoftware.com/articles/FiveWorlds.html). This implies exactly the sort of changes you're talking about, but most people haven't realized this yet. However, it affects us (SI) and it affects organizations building and deploying services internally into an SOA.

What I think we'll find is that eventually, more and more people are going to realize services are shrinkwrapped, not add-hoc. When this happens, everyone's life is going to be better, but it'll mean big changes to the way P1 & P2 do their jobs--exactly as you've described in your article.

Anyway, I've enjoyed reading your blog.

Cheers,

ast

By Blogger Muli Koppel, at 4:54 PM  

Hi Andrew

Thanks for your comment

cheers
muli

Post a Comment

Thursday, August 25, 2005

On the Utility of the GRID, part II

In the previous post on this matter, I have defined Utility Computing not as a technology but rather as a "goes-without-saying", trivial, merciless business requirement for today's real-time, adaptive, partially or entirely virtual Enterprises. I have defined GRID as the technical solution for the Utility Computing business requirement. Also, in a very simplistic, yet deliberate manner I've described the essence of the GRID as an automation layer for a scaling-out procedure. Scale-out architecture preceded the GRID in years; it’s the automation of the scaling-out that's new.

But don't let me trick you so easily. Our "simple" automation procedure, if well realized, brings the best IT Management system an Enterprise could possibly dream of. And that's because this simple automation process can work ONLY when the following exist:

1. An up to the second updated inventory with all available nodes. In order to perform an automatic scale-out, our simple process must know which nodes are available and which are already in use.

2. This inventory cannot be just a list of available nodes; it MUST be a CMDB (Configuration Management DataBase). Our simple process cannot scale-out to any available node on the inventory nodes' freelist. The candidate node must meet the business application' system configuration requirements! That said, configuration information must be meticulously managed so no erroneous scale-out happens.

3. Our simple process MUST be able to install on-the-fly whatever software required for the operation of the business application; it must also connect the new node into the relevant disks (San/Nas/Jbods), network & storage switches, load balancers and so on.

4. In order to correctly perform the above, our simple process must be aware of whatever Enterprise policies, as well as vendors' restrictions (Looks like our simple process is actually an automation of all the sys admins provisioning checklists…)

Hang-on a second: what invoked our simple process in the 1st place?

5. Our simple process MUST be tightly integrated with a monitoring system that invokes it when the business application is in a need for a scale-out.

6. When a fault occurs to a sub-component of the Business Application (say a disk array), there's no way the monitoring system will be able to link the fault to the business application. Unless – it has an access to the Enterprise data center world topology, and it has the capability to perform impact analysis.

And so on.

What has been furtively described here is the dream of any Enterprise: a complete life-cycle data center management, where data center objects (hardware & software), policies and users are all linked together to yield Utility Computing IT infrastructures.

This is the basis for the removal of all reasons for downtime, except application bugs. As all provisioning processes are automated by the GRID layer, misconfigurations (reason 3) can no longer happen, and users can no longer abuse security holes in systems and run unauthorized programs (reason 4).

If you're interested in all the prerequisites for an Enterprise Grid solution, do have a look at this excellent Enterprise Grid Reference Architecture, by the Enterprise Grid Alliance.


0 Comments

Post a Comment

Wednesday, August 24, 2005

OntHEology

There is a theological aspect to ontologies that I'd like to share with you.
I call it ontheology.

I remember the disappointment I had after attending, for the 1st time, a W3C lecture on semantic web and ontologies. I expected to get a clear explanation on how a MACHINE can understand the meaning of words by using ontologies. I thought I'd see something like a dictionary entry surrounded by all sorts of meta-tags that will make it machine comprehensible. After all, that's what Tim Berners-Lee vision of the semantic web is all about.

But eventually what we've seen was a demonstration of syllogism triplets, i.e. if sentence A and sentence B, then sentence C.

But how the MACHINES can understand the meaning of the words in sentence A, B or C?

Puzzled and frustrated, I went to my car, accompanied by Mr. Grossbard. We both tried to figure out what we have missed. Then we dwelled into an interesting conversation about the nature of meaning, when all of a sudden I understood how fool we were.

If you try to explain to someone the meaning of any word, how'd you do that? By using other words! You'd then use more other words to explain the other words used to explain the original word. And so it goes, on and on and on – forever.

Probably not forever; probably, if you go up enough in the ontology tree you'll encounter the essence of all Words. Ontologists call this original word Thing, or Root. Have a look at Protégé 2000 – an open source ontology editor. The first class in the ontology is always ":Thing". All other classes (i.e. Words) are derived from ":Thing".

If we merge all the world's ontologies to create the ultimate human kind knowledge-base, which word will be :Thing? (Or as closest to :Thing)

"In the beginning was the Word, and the Word was with God, and the Word was God", book of John, 1:1.

So :Thing is God.

(that's a syllogism a' la Shirky :) ).

And from here, an interesting thought:

The 1st linguistic act of Adam in the Garden of Eden was to name the beasts (Genesis, 2:19). It is said in the Bible, that the names given by Adam were the animals' souls. It was well before Babel, where God introduced arbitrariness between a Word (signifier) and its Subject (signified). In Eden, Words were the Subject. After Eden, Subjects have gone for good and we have become subordinated to a world of words pointing one at each other.

In an allusion to my earlier post on Web 2.0 and Ontologies, I'd say that ontologists are aspiring at the recreation of Eden, or the re-establishment of "Word Order". They assume reality is objective, i.e. that there's an animal out there and now its time to name it – like Adam.

While Clay Shirky represents the post-Babylonian chaos, the negation of an objective reality and the acceptance of a life in a world of Words (or tags…). Still, there's a very interesting twist to Clay Shirky's view (as I interpret it): unlike the natural intuition, the post-Babylonian era is not conceived as a disaster or as a lesser-degree reality. On the contrary – it seems that man has finally found his real home and his real love: playing with words.

1 Comments

By Blogger Udi h Bauman, at 12:21 AM  

Brilliant!
If God is the objective ontology author, & evolution the decentralized subjective mechanism of social interactions, that uses genetic words & tiny decentralized information algorithms, then our play of words probably isn't just a game, but a part of the mechanism of words/memes/tags/ideas that aggregate the low-level subjective decentralized ontologies into a higher-level conscious intelligence, moving steadily toward higher order & in my very subjective hubmle view, also toward morality.

But of course, let's stick to the ground with the beautiful SOA & Grid, our current tasks in the ants trail.

Post a Comment

Tuesday, August 23, 2005

On the Utility of the GRID

There's a considerable dissatisfaction and confusion around the actual, practical meaning of GRID and Utility computing concepts. Are Grid and Utility interchangeable, partially overlapping or different? Add to that soup the on-demand concept, and you're beaten-up.
I was recently contracted by a Utility-computing startup and used the word GRID in some of my presentations. The CEO insisted on replacing all the word's occurrences with Utility, claiming the confusion around GRID is bad to business.I assume that a GRID-computing startup would have had the same reaction towards the word Utility.

It's time, than, for a little definitions exercise.

Most of today's Enterprises have a virtual existence besides their physical one. Some of today's Enterprises are represented solely on the virtual sphere. Changes on the virtual sphere are not only occurring every passing minute – they are also identified, classified and digested in a light speed fashion. In order to stay tuned, Enterprises had to reengineer almost every aspect of their existence: customer touch-points, business processes, application packages, web services, real-time event processing, business QoS and so on. Enterprises are in the process of becoming a real-time, adaptive, organic system.

Under this paradigm, IT is the business. More precisely: IT functionality is the business. IT infrastructure is nothing but plumbing (this hasn't changed from the past. The current change is in the business-value of the IT functionality). The last thing CEOs want to hear about is that they are losing money because of plumbing problems. CEOs can understand Billing functionality limitations or difficulties in implementing a complex churn-prevention business process, but they wouldn’t understand nor accept plumbing issues. Putting differently, the trivial requirement from the eyes of the CEO is that IT infrastructures will function just like any other infrastructure in the premises (electricity, water, air conditioning).

One possible definition for Utility Computing is therefore: "[almost always] uninterrupted supply of computing resources".

Do note: we have just defined Utility Computing as a business requirement and not as a technical solution! IT infrastructures in today's Enterprises must function like any other utility infrastructure.

I assume we all agree on this requirement. Let's give it a nice technical solution…

Following are common reasons for downtime. Knowing these reasons will allow us to build an architecture that would prevent them from happening.

Common reasons for downtime:

1. Hardware failures
2. Computing resources shortage in peak utilizations
3. Misconfiguration
4. Humans and Security
5. Application bugs

While there are many reasons for downtime, the most notorious are hardware failures and sudden shortage in computing resources. Actually these two are responsible for less than 20% of all outages, while misconfiguration (missing or incompatible elements, wrong setup, forgotten monitoring/backup agents etc), human & security (i.e. unauthorized access to critical resources) and applications bugs are, as a matter of fact, the primary cause for most of the downtimes (these figures are based on experience and they are backed-up by Forrester Research reports).

So why hardware failures and sudden lack of resources are burned into our consciousness as the prime, trivial, suspects? Because infrastructure problems are relatively easy to solve [add more hardware=known, brainless solution to an unknown problem; correct application bug=unknown solution to an unknown problem] and there's a concrete scapegoat to hang (a small group of sys admins vs. hundreds of programmers).

Providing solutions for downtime reasons 1 and 2 will not yield the desired IT utility infrastructures. Still, a complete solution is built in phases; removing reasons 1 & 2 from the stack is a progress. Which brings us to the definition of GRID.

GRID is a technical architecture, providing solution for downtime reasons 1 and 2. It replaces existing technical solutions, which are mainly failover and manual scale-up. GRID comes in two flavors: the SSI flavor and the Scale-out one.

Let's elaborate.

Most of the enterprise business applications were architected to scale-up, i.e. running on a single physical SMP server. If more computing power is needed, then extra computing power is added to the server, until the server reaches its max capacity. When this happens, the application can be migrated to a higher capacity server. These alternatives are mostly manual, with the exception of mainframe-like UNIX boxes (Sun F15 as an example) that support dynamic resources reallocation.

Coping with potential hardware failures is done by employing failover techniques to a standby node (whether active or passive). The problem with failover is that it's time consuming, specifically for very large database servers (it's the mounting of the file systems on the standby node that takes time as well as automatic recovery to the database etc). A 2-4 TB database server can wait good 10 minutes or more for a failover to complete.

Manual reallocation of resources and the time it takes to perform a failover are clearly inadequate with the utility business requirement.

While clusters have no alternative under the scale-up architecture, coping with an unexpected peak had a simple solution: over-capacity planning. If X computing power is what needed for an application, then X+20%++ was actually purchased and configured. This solution is expensiveand not cost-effective. Moreover, experience shows that the application ended up over utilizing the extra capacity (usually as a result of a bug introduced in one of the endless application changes reflecting the dynamics of the real-time business).

More suitable application architecture for today's requirements is the scale-out one. Under this architecture, the different modules of the application can have multiple instances running concurrently on different nodes, dividing the total workload among them. Theoretically, an application can span across as many nodes as needed. No failover clusters are required: if one instance fails, other running instance replaces it. And a direct economic advantage: commodity hardware can be used as a pool of 2-way servers, reaching similar computing power as an 8, 16, 32-way servers in the old scale-up architecture.

A scale-out GRID provides the necessary automation for the provisioning of new application instances on an ad-hoc allocated server.

An SSI GRID can be viewed as an interim, backward-compatible GRID solution for scale-up applications. The application "believes" that it run on a single server, while actually it runs on multiple servers. Putting differently, SSI is a resource virtualization solution.

That's it for today: we have defined Utility Computing as a business requirement and GRID computing as the automation of resources provisioning in a scale-out architecture. Of course, it's too short of a definition, so what I'd like to discuss next, is how this same automation of scale-out architecture solves downtime reasons 3 (misconfiguration) and 4 (human errors & security), providing therefore the necessary elements for a utility, nonstop, IT infrastructure.

The sequel to this post is On the Utility of the GRID, part II.

1 Comments

By Anonymous David Faibish, at 9:04 PM  

Thanks for articulating the "downtime" definition. It's helpful with the challenges I have currently in my company.

Post a Comment

Monday, August 22, 2005

Comments on "PHP, Perl and Python on the wane?"

The article PHP, Perl and Python on the wane? brings indications to a decline in Enterprise utilization of the 3 Ps of lamP: Php, Python, Perl. Still, most of the article discusses PHP and not Perl/Python. The Foucs on PHP is evident, as Oracle and IBM have invested some millions in PHP just recently.

A note on Perl & Python: these two scripting languages are sophisticated; they have the power and capabilities of modern 3rd-gen languages like java and c (though, many of these capabilities are sub-written in C - but who cares). They are "open source" languages with very strong communities – meaning they are the outcome of a highly collaborative effort, fitting to the new model of the internet (yes, Web 2.0 again). Their only disadvantage is that they are "scripting languages" - which makes them "not here, nor there". Not here – they require higher level of investment and they have longer learning curve than bash or csh; not there - they are considered scripting languages so they don't have the seriousness and the "establishment"-backing as java, c and the other "official" languages.

Personally, I used to program billing, rating as well as many infra programs with Perl. What I'm hearing, though, is that Python is ten folds better, faster and cooler than Perl. The ability to have a fast working prototype with Python is - so I'm told - amazing. Also, many startups are now focusing on providing python/perl development and run-time platforms. ActiveGrid is a relatively known one and they just launched their grid lamp application server with python as the 1st supported programming language (Perl is next, java – last – if I'm not wrong).

If web 2 paradigm will catch, I don't see how Python and Perl could possibly left aside. On the contrary - these languages are the emblems of this era of community and collaboration. It will take some time until Enterprises will digest the revolution that is happening. Nowadays, Enterprises (and CIOs…) are still looking in apprehension at this new era (with the open-source movement being its socio-economic, consumer-driven flagship), which undermines the old principles of progress & assurance: we let the BIGS advance the technology and we buy (quality?) products from them, as long as they can provide job security ("No CIO was fired because he/she bought something from IBM…"). In the new world, best products are recognized, as well as produced, by the community, i.e. by the techies and the geeks; Traditional sales & marketing efforts are less and less the power behind progress and recognition.

Let us return to our sheep: Java, as well as the rest of the standard-bodies languages, has lesser chance to survive than Perl & Python. There is a lot of "image" (what others will say if…) in the usage of Java and J2EE. But images are trendy and prone to changes (indeed, just like python and perl, but still it's their collaborative/community nature that counts here….)

2 Comments

By Anonymous Eyal Milrad, at 1:02 AM  

Though I totally agree that Python and PHP are becoming a dominant factor for web programming, mainly because of its supporting communities, I don't believe that java is prone to early death.
A good example is the story behind the birth of J2EE 3.0.
Anyone who worked with J2EE 2.0 (Or to be more precise, anyone who had to use the EJB 3.0 Entity Beans) would probably admit that it was a disaster. Actually, it was so awful that it raised real doubts about the chances that EJB can be used for writing complicated server applications as JSP did with web servers.
However, apparently, java community proved it has strong presence.
A community project named Hibernate made the uses of persistent objects as simple as using POJOs (Plain Old Java Objects), programmers could write fully server applications with WebLogic, WebSphere and JBOSS (another java open-source success story) with just plugging the Hibernate libs and spare the hassle of using EJBs.
In matter of fact, it was so successful, that J2EE 3.0 was totally revised (A Committee work), to reflect the new ideas presented in Hibernate.
This project as others (Spring, Tomcat, JBoss and many more) demonstrated the real power of Java and J2EE; communities can and do creates out-of-the-box solutions, while JCP who's supported by BEA, Sun, IBM, Borland and hundreds more members, design the next generation of J2EE by picking up the best ideas (though sometimes not for the user's best interest).
Java has great future, as well as PHP and Python, and for the same reasons, all three
are the result of the Open Source movement, and therefore are backed by both communities and industry (as opposed to Free Software Movement that sometimes is being confused with Open Source, but that's for different article).

By Blogger Muli Koppel, at 3:54 PM  

Hi Eyal,

Your analysis of the java-related communities is righteous! And as you correctly noted, they replace the official standard bodies around java and j2ee. I heard they replace J2ee altogether, though. Anyhow, I'll add Java to my shopping cart...

Post a Comment

On Web 2.0 and Ontologies

I have just searched "python" in Technorati. Suddenly, pictures from flickr bearing the tags "python" were presented to the right of my search results screen, along with lines from other blogs. Though I knew this feature of technorati/feedster, it was the 1st time I really saw it in action.


Web 2, with del.icio.us, last.fm, flickr and more is wrapping us. The change is so fundamental; Internet is actually becoming a river of digital consciousness, just as I wrote couple of years ago in an article about Microsoft Collaboration practices. Actually, what Microsoft tried to do in the collaboration area was something similar to web2.0, but it lacked something fundamental that I feel inside my veins: it lacked real people.

Today we see how web2 forms in front of our eyes. I am amazed and to that I add a bit of puzzlement. Ontologies are under serious attack; serious - because the attacks are managed by smart people with real arguments. Clay Shirky is one of them but I'll give an unexpected example: Adam Bosworth, Google VP of engineering (can be heard at IT Conversations). And there are more.
I'll sum up my intuitive understanding of their arguments: Ontology is reasonable when reality is objective. When Objective - bring a "reality domain expert" and he/she will describe reality (in triplets or quadruples - who cares…). And that's the paradigm of web 1.0 - we're putting stuff on the net and people are viewing it: passive to the objective digital world that was previously uploaded into the net by some "experts". But web 2.0 proved the ontological approach to be ridiculo.us! How many different tags (i.e. classifications) a single real entity (URI) has? All tags bearing different meanings to different people in different contexts. Web 2.0 is the post-modern of the web 1.0 modernism: there's no absolute truth, my interpretation of a URI is no better no worse than yours. In that case, objective description of reality in the form of ontology is pointless. Bosworth discusses very briefly RSS 1.0 which was based on RDF. He claimed it failed because it's complex and people don't think syllogism (that's from Clay…). RSS 1.0, he said, was the result of a standard committee, while RSS 2.0 was the outcome of a single man initiative (like Joshua Shachter's from del.icio.us). The differences are clear-cut: standards (IEEE etc) are formalization of objective world - objective but highly compromised! Web 2.0 has no standards! Standards and web 2.0 are almost antipodes. No standards and no compromises. I'm not taking position here!!! I'm not saying it's good or bad. But, post-modernism brought a lot of good things along with a lot of bad things, such as the potential "eradication" (that's probably extreme…) of solid moral ground. Web 2 has a potential to bring over huge moral problems - I am not aware of anyone who tackles this issue. When every point of view is legit, no point of view is valuable!

Also, a good question is what will be the future of standards. We, architects, are in a strange period, where standards are treated in contempt - and I agree they are highly problematic, complicated and useless... Bosworth, for instance, gives his pitch against the WS-* - they indeed are the most glorious yet ridiculous bunch of standards. How many of them exist? No one counts. And yes, they are as complex as RDF.


5 Comments

By Blogger Udi h Bauman, at 6:48 PM  

Very nice & informative post, thanks.

I don't see the morality problems, because I see the bottom-up subjective approach as an embodiment of democracy, in its ultimate form, like Popper's open society. Eventually, when the voice of everyone is heard & aggregated, the voice of the collective intelligence is heard, which I believe is likely to be more moral than a dictated morality.

Standards are being replaced by ad-hoc standards. See for example microformats.org, which do a great job in identifying & defining them.

The ultimate Web2.0 will be the semantic web. Taggs provide some semantic, but OWL, SWRL & reasoners will truly turn the Web into a digital concsiousness, & bring amazing collaboration between its nodes (humans, agents & services).

By Blogger Udi h Bauman, at 6:51 PM  

More explicitely, I don't see the ontology & Web2.0 approaches as opposite but rather as complementing. I can see personal OWL ontologies, from which you get diverse opinions, but aggregated knowledge, information & services.

By Blogger Muli Koppel, at 7:46 PM  

Hi Udi,

Thanks for the microformats. I'll look into this.

Regarding the ultimate web 2.0: you should confront yourself with the difficulties Clay Shirky's putting vehemently in his article " The Semantic Web, Syllogism, and Worldview " (http://www.shirky.com/writings/semantic_syllogism.html).

But let's tackle something we both doing a lot: tagging in del.icio.us. What kind of AI do you need to grasp their meaning? To understand they all refer to the same URI? This is an extremely difficult problem. Most of the Web 2.0 collaborative infrastructure suppliers explicitly claim they will not deal with meaning – they are leaving it to the users.
Semantic misunderstanding is something fundamental to humans. Austin's book on Pragmatics points brilliantly to the problem. I'm citing here from a short overview of the "problem" as described in "What is Pragmatics?" (http://www.gxnu.edu.cn/Personal/szliu/definition.html):
"The ability to comprehend and produce a communicative act is referred to as pragmatic competence (Kasper, 1997) which often includes one's knowledge about the social distance, social status between the speakers involved, the cultural knowledge such as politeness, and the linguistic knowledge explicit and implicit".

The challenges facing the Machines in their attempt to understand the above contextual considerations based on facts listed in ontologies are serious enough. Add to that a Babylonian tagging, where almost every individual has its own ontology and you get a mission impossible.

I know you can solve it, though... :)

By Blogger Nikhil, at 12:04 PM  

The question is going to be who needs to understand the semantic web and how. With strikeiron and other services leading to the exposure of core capabilities to other machines (as WS) composible solutions are going to become the norm and the semantic web is going to become instrumental in describing that from a shared service perspective. From a usable, customer interaction perspective, its going to be more Web 2.0 - where the end results are shared. I already know companies that are driving to SWRL and shared ontologies - that will be enablers at the shared service level. The composition of this and the using of this is going to be the customer facing end of the web.

Tim O'Reilly provides an interesting article http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html?page=1

The implications of this from a business model perspective are going to be phenomenal. I also think that reuse is going to be the CNC machine of the IT world..

Cheers,
Nikhil

By Blogger Muli Koppel, at 4:52 PM  

Hi Nikhil

thanks for your comment. Yes, the semweb is inevitable. And I invite you to have a look at my post on Faustian deals and magical clipboards - Ray Ozzie's live clipboard wouldn't work without a shared ontology. And yet, if I understand correctly, ontologies, like standards, are about to be micro-formated, meaning the definition of what's there will stem from one or two subjects that would agree first and the others would come next. That's different than the current situation, where committees define a "Customer" for years.

cheers
muli

Post a Comment

Saturday, August 20, 2005

Architecture To Go

I have decided to launch Architecture To Go Ltd after ~5 years of being the Chief Architect of Orange Israel - a mobile gsm/3G Hutchison Whampoa LTD company.

"Who's afraid of Enterprise Architecture?" was the recurrent theme of the first couple of years. Eventually, though, Enterprise Architecture proved itself to be well rewarding, business and socially-wise.

This Blog was conceived with the following ambition: making technology and Enterprise Architecture much more accessible and understandable to IT people than it is today.

The Blog will cover the following topics:

EAI, SOA, Information Integration, GRID & Utility Computing, Linux migrations, Service Assurance & SLA, Data Center life-cycle management and more.

Hope you'll find it useful (and enjoyable).


Architecture To Go Posted by Hello

0 Comments

Post a Comment