Always-On: Re-engineering the Read/Write Web and the Enterprise (with del.icio.us examples)
Last week del.icio.us went down for several days, creating a domino-effect of failures all across the web, as many sites (myself included) were mashing-up the del.icio.us APIs for information retrieval. This was an acute reminder of the inherent, hidden fragility of SOA implementations – Enterprise or WWW alike. But this time, I am going to offer a solution.
This will be a longer post than usual, but I hope it will be rewarding, as I am going to present you with a most unusual requirement for an SOA implementation, which if met properly can change the way you think SOA and bring salvation not only to Enterprises, but also to the not-yet-matured Read/Write Web.
In early 2001, I designed the first version of Orange's bespoke SOA framework. I was obsessed about availability & reliability, as the SOA hub was about to become the central execution and routing engine of a Telco company, meaning downtime was not an option. I remember a meeting I had with Orange's EVP of Technologies, who made it clear to me that if this SOA stuff is not going to be as reliable as an Ericsson switch, then he wouldn't approve it. And to remove any shadow of a doubt he explained in great details what he meant by an "Ericsson Switch": when a voice call is created in the telecom switches, there are always two switches involved - a master and a slave. If the master fails, the conversation continues uninterruptedly by the slave, which is keeping constant tracks of the conversation at the master. "That's the availability and reliability I would like to have from your SOA hub", he concluded.
I was fascinated by this engineering-oriented manager, to whom IT was an eerie, money-sucking dark force, and yet he got the essence of SOA in matter of seconds, doing the correct analogy to his familiar landscape of telecom switches and IP routers.
So we have engineered our bespoke SOA hub to yield such a fabulous availability and reliability, making it suitable for Telco-grade operations. But this is not the unusual requirement I mentioned at the beginning, so you'll need to hang on with me for a bit longer.
When we have launched, our bullet-proof SOA hub was indeed as reliable and as available as designed. All across the Enterprise, departments started to use Services, to build mash-up applications, seeing Time-to-market shrink and Productivity grows.
But one day everything crashed. The engineering guys and all the rest of my lovers had their del.icio.us day.
Everything crashed, but it was not the SOA hub that failed. One of the systems providing a most popular Service was not responding for whatever reason. Without getting into too many details – this failure created a chain of other failures and we ended up with a crashed IT.
Yet, everybody was looking accusingly at us, the SOA framework providers.
"Guys, wake-up", I said, "one of the systems was down - it was not the SOA hub! You don't really expect that I will guarantee that the service provider is up and running - that's impossible! I'm just the plumber, the BUS. If you take a bus to meet a friend, the bus driver cannot guarantee that your friend will actually wait for you when you get down at the station".
"Man", they said", "you introduced this Services façade claiming that we no longer need to mess up with any system besides our own. Well, we bought into your story, and now you're coming and telling us you cannot commit? That's not going to happen: either you commit that whenever we access a Service – it’s there - or get out of our way and let us build our programs the way we used to before you came with your SOA stuff".
Although this dialogue was never vocally pronounced, it has become clear to me that by providing an Enterprise SOA I was expected to assume an Enterprise responsibility. Naturally, that is beyond the scope of any SOA framework. No SOA supplier – us included :) - can guarantee availability of the Service providers. But that was the unusual requirement I was facing: provide an SOA framework, in which the Services are always-on. How do we do that?
Some comments I got from my colleagues at the time, as well as from my current customers and from ISVs (IBM…) to whom I presented this challenge:
"Well, Providers shouldn't be down! make them highly-available!"
"Use clusters!", "Use Oracle RAC!"
"You have lousy systems architecture if your mission-critical applications fail!"
"The Mainframe never fails; we put all our stuff on the mainframe."
And on and on it goes.
All the advices were provider-oriented and optimistic. Provider-oriented - because they claimed something has to be done with the provider in order to guarantee an always-on Service; Optimistic - because they assumed that once the provider is fortified it will always be on.
My experience taught me that pessimistic or paranoiac designs are better. As a rule, I believe systems should be allowed to rest! There are upgrades (of OS, DB, App Server); there are bugs, human errors, lack of procedures and disasters of all kinds. So obviously, we could invest some 5-10 millions dollars and make each service provider theoretically bullet-proof (human errors and disasters can always happen). Differently put, a pessimistic planning is not a bad idea.
In the search of the always-on Service' solution, our focus has changed. If we were to accept an axiom in which applications were allowed to R.I.P, then applications could no longer play an important role in the architecture. Applications have become OPTIONAL to the Service execution. The only way to cope with such a requirement was to shift our focus from Applications to Information. We concluded that we had to protect the Information, not the Applications.
We then looked differently at the Services we got; no longer as facades for processes invoking application functionality (APIs), but rather as Information Retrievers or Information Modifiers. We have realized that Services which are Information Retrievers are most popular and less tolerant to failures, which means they were part of a synchronous transaction. In contrast, Services which were Information Modifiers were less popular and highly tolerant to failures, meaning those who consumed them could get along pretty well with a later execution [as a result of a provider downtime]. This analysis formed the basis for our solution: the Information required by Information Retrievers had to be protected in an always-on manner.
A year later, our Enterprise SOA framework had an additional construct – let's call it “Google”. Information Retrievers were not redirected to the applications that created the information but rather to our Google, which was kept up to date in a [near]real-time fashion and was also designed like an Ericsson Switch. :)
I am keeping some more professional secrets to myself… I hope, though, that the general idea is somehow clear.
I think that 2005 proved this approach to be righteous. The focus of the entire industry has been shifting from Applications to Information. We all agree today, that there is no need to visit a Web-Site (an application) in order to have the Information we want/need. Through Information syndication all the data I need comes to me. But the del.icio.us failure is a warning sign for this remix generation (a term coined by Vinod Khosla).
Del.icio.us failure proved (at least) two things:
1. That the value of Information is subjective - what one considers noise, other considers gold. The del.icio.us downtime was undoubtedly painful for some individuals, while others couldn’t care less.
2. That we have to have the most pessimistic approach regarding Information protection and that we have to assume responsibility in a global manner. We cannot leave this responsibility to the web-sites owners. Web sites, like applications, should be allowed to rest, but the Information they got must be always-on.
It's time, then, to Google. "Google is the only globally scalable distributed system that can handle all information in all languages all the time". This sentence is taken from an absolutely fascinating presentation by Mr. Steele, titled “Steele on Intelligence - What can we know, how? Reflections on the near future. Google versus the CIA—Five Year Outlook”.
I am joining this observation and suggesting that Web 2.0 applications would be built around the concepts of Information Modifiers and Information Retrievers. The following del.icio.us re-engineering would exemplify it:
Del.icio.us will provide an Information Modifier API, in charge of creating/modifying/deleting the information inside del.icio.us repositories. Once the information is modified, it will be captured, analyzed and stored by Google (either push via a Google API or pull via a Google appliance). Del.icio.us will provide an Information Retriever API that would be hosted by Google and retrieve the del.icio.us information from Google. The del.icio.us Information Retriever API will give the Googled information the required look and feel and other needed aspects of Information presentation.
If del.icio.us is down then no Information could be modified. This in itself might be unpleasant, but it is certainly not as catastrophic as not being able to retrieve the Information that is already there – and for that we got Google.
For Google's never down.
Why was I reading techie stuff on Christmas Eve? I don't know (and shame on me), but I really enjoyed your post. It was very timely as well as more and more SOA-related discussions are on the horizon for me.
By 11:53 PM, at
Great post, really! Thanks a lot for your essay