Tuesday, August 23, 2005

On the Utility of the GRID

There's a considerable dissatisfaction and confusion around the actual, practical meaning of GRID and Utility computing concepts. Are Grid and Utility interchangeable, partially overlapping or different? Add to that soup the on-demand concept, and you're beaten-up.
I was recently contracted by a Utility-computing startup and used the word GRID in some of my presentations. The CEO insisted on replacing all the word's occurrences with Utility, claiming the confusion around GRID is bad to business.I assume that a GRID-computing startup would have had the same reaction towards the word Utility.

It's time, than, for a little definitions exercise.

Most of today's Enterprises have a virtual existence besides their physical one. Some of today's Enterprises are represented solely on the virtual sphere. Changes on the virtual sphere are not only occurring every passing minute – they are also identified, classified and digested in a light speed fashion. In order to stay tuned, Enterprises had to reengineer almost every aspect of their existence: customer touch-points, business processes, application packages, web services, real-time event processing, business QoS and so on. Enterprises are in the process of becoming a real-time, adaptive, organic system.

Under this paradigm, IT is the business. More precisely: IT functionality is the business. IT infrastructure is nothing but plumbing (this hasn't changed from the past. The current change is in the business-value of the IT functionality). The last thing CEOs want to hear about is that they are losing money because of plumbing problems. CEOs can understand Billing functionality limitations or difficulties in implementing a complex churn-prevention business process, but they wouldn’t understand nor accept plumbing issues. Putting differently, the trivial requirement from the eyes of the CEO is that IT infrastructures will function just like any other infrastructure in the premises (electricity, water, air conditioning).

One possible definition for Utility Computing is therefore: "[almost always] uninterrupted supply of computing resources".

Do note: we have just defined Utility Computing as a business requirement and not as a technical solution! IT infrastructures in today's Enterprises must function like any other utility infrastructure.

I assume we all agree on this requirement. Let's give it a nice technical solution…

Following are common reasons for downtime. Knowing these reasons will allow us to build an architecture that would prevent them from happening.

Common reasons for downtime:

1. Hardware failures
2. Computing resources shortage in peak utilizations
3. Misconfiguration
4. Humans and Security
5. Application bugs

While there are many reasons for downtime, the most notorious are hardware failures and sudden shortage in computing resources. Actually these two are responsible for less than 20% of all outages, while misconfiguration (missing or incompatible elements, wrong setup, forgotten monitoring/backup agents etc), human & security (i.e. unauthorized access to critical resources) and applications bugs are, as a matter of fact, the primary cause for most of the downtimes (these figures are based on experience and they are backed-up by Forrester Research reports).

So why hardware failures and sudden lack of resources are burned into our consciousness as the prime, trivial, suspects? Because infrastructure problems are relatively easy to solve [add more hardware=known, brainless solution to an unknown problem; correct application bug=unknown solution to an unknown problem] and there's a concrete scapegoat to hang (a small group of sys admins vs. hundreds of programmers).

Providing solutions for downtime reasons 1 and 2 will not yield the desired IT utility infrastructures. Still, a complete solution is built in phases; removing reasons 1 & 2 from the stack is a progress. Which brings us to the definition of GRID.

GRID is a technical architecture, providing solution for downtime reasons 1 and 2. It replaces existing technical solutions, which are mainly failover and manual scale-up. GRID comes in two flavors: the SSI flavor and the Scale-out one.

Let's elaborate.

Most of the enterprise business applications were architected to scale-up, i.e. running on a single physical SMP server. If more computing power is needed, then extra computing power is added to the server, until the server reaches its max capacity. When this happens, the application can be migrated to a higher capacity server. These alternatives are mostly manual, with the exception of mainframe-like UNIX boxes (Sun F15 as an example) that support dynamic resources reallocation.

Coping with potential hardware failures is done by employing failover techniques to a standby node (whether active or passive). The problem with failover is that it's time consuming, specifically for very large database servers (it's the mounting of the file systems on the standby node that takes time as well as automatic recovery to the database etc). A 2-4 TB database server can wait good 10 minutes or more for a failover to complete.

Manual reallocation of resources and the time it takes to perform a failover are clearly inadequate with the utility business requirement.

While clusters have no alternative under the scale-up architecture, coping with an unexpected peak had a simple solution: over-capacity planning. If X computing power is what needed for an application, then X+20%++ was actually purchased and configured. This solution is expensiveand not cost-effective. Moreover, experience shows that the application ended up over utilizing the extra capacity (usually as a result of a bug introduced in one of the endless application changes reflecting the dynamics of the real-time business).

More suitable application architecture for today's requirements is the scale-out one. Under this architecture, the different modules of the application can have multiple instances running concurrently on different nodes, dividing the total workload among them. Theoretically, an application can span across as many nodes as needed. No failover clusters are required: if one instance fails, other running instance replaces it. And a direct economic advantage: commodity hardware can be used as a pool of 2-way servers, reaching similar computing power as an 8, 16, 32-way servers in the old scale-up architecture.

A scale-out GRID provides the necessary automation for the provisioning of new application instances on an ad-hoc allocated server.

An SSI GRID can be viewed as an interim, backward-compatible GRID solution for scale-up applications. The application "believes" that it run on a single server, while actually it runs on multiple servers. Putting differently, SSI is a resource virtualization solution.

That's it for today: we have defined Utility Computing as a business requirement and GRID computing as the automation of resources provisioning in a scale-out architecture. Of course, it's too short of a definition, so what I'd like to discuss next, is how this same automation of scale-out architecture solves downtime reasons 3 (misconfiguration) and 4 (human errors & security), providing therefore the necessary elements for a utility, nonstop, IT infrastructure.

The sequel to this post is On the Utility of the GRID, part II.


Anonymous David Faibish said...

Thanks for articulating the "downtime" definition. It's helpful with the challenges I have currently in my company.

9:04 PM  

Post a Comment

<< Home