Friday, March 31, 2006

Organizational Architecture for the Real-Time Enterprise

I was known to be a "Reorg" freak, pooling out people from their daily jobs and giving them an ad-hoc assignment based on their talent. In these unexpected attacks on my descent, hard-working employees I was entirely disrespectful to their title, official expertise or their department. Whoever had the required combination of skills and personality was a candidate for a reassignment. Thus, for a risky cross-organizational and political project, I pooled out a DBA who I knew to be both highly human-oriented and a "take no prisoners" kinda guy. Sometimes, I was engaging bright-minded and motivated geeks in covert missions, so their managers who reported to me wouldn't be able to use it as an excuse for delays on delivery.

That was an "early adopter", beta version of what I call flat, ESDR organization. To better understand this concept and for having guidelines on how to implement it in your own organization, please keep on reading.

Endless Unpredictability

Paradoxically, what is mostly missing in the real Time enterprise is nothing else but Time. The more real is Time, the more it is absent. The rapid changes in technology, business and trends created a context in which Enterprises have to cope with an endless flow of unpredictable events. This lethal combination of endless and unpredictable is what kills Time and what puts Enterprises in peril. It is possible, of course, to ignore events, but that implies missing a potential business opportunity or a critical fault that has to be taken care of ASAP. Enterprises, therefore, must transform themselves into real-time management and control machines, capable of processing endless, unpredictable events and react to them in no-time.

What is ESDR?

This transformation requires different strategy, structure and architecture in any Enterprise dimension: organization, processes, applications, infrastructures and so on. What applies to the Enterprise as a whole, applies no less to each of its components.

The architectural principle underlying the no-Time Enterprise is ESDR, which stands for Events, Services and Dynamic reallocation of [general purpose] Resources. The real-time Enterprise is all about on-the-spot capturing of Events and reaction through Services, to which Resources are dynamically allocated. The entire process is policy-driven (or business rules driven) and it runs for ever and ever. Currently, the only domain that has full-blown slideware of ESDR is that of Utility Computing, but that's another story.

The ESDR architecture is radically different than that of the current Enterprise. Until recently IT organizations were managed without any IT Governance solutions, meaning that most of the CIOs and the other IT managers couldn't have a real-time, detailed, and open view on their own business. Yet, having this kind of Information is only the first step; in itself the Information is a dead horse (NSA had the Information before the 9/11 attacks!). So in the real-time Enterprise the Information must be available, but it also has to go through constant processing, alerting, and reaction. And the reaction is not allowed to fail on technical issues like availability and performance. Hence, the DR in the ESDR architecture - the ability to dynamically allocate and reallocate resources at run-time.

General Purpose Resources

In the past, every Resource had a function to which it was statically bound. For instance, once a Sun server has been designated as the Billing database server, it remained so until its end of life. And if repurposing was considered – well, it was considered thoroughly: nothing that a real-time Enterprise can wait for.

For the ESDR to be well optimized and cost-effective, it is preferable to have general purpose resources. A single-purpose resource is more likely to have an idle time (which is bad) than a general purpose one. For example, in the past servers were capable of running only the OS of their vendor. Sun boxes could only run Solaris, IBM Mainframe - only the Z/OS and so forth. That's an example of a single-purpose resource. If a utilization peak or any other kind of failure happens to an application running on a Windows server, it is impossible to reuse the idle Mainframe partition in order to recover that application.

Most of the organizations are structured in the same single-purpose/single-function manner. A storage manager working in the storage team is (conceptually) bound to its team for good. He's a single-purpose resource. If the DBA team is short in manpower for a certain project – they wouldn't look for "spare resources" in the storage team.

The single-purpose inefficiency is bad for dynamic reallocation, and therefore most of the vendors have reengineered their resources as general-purpose ones. A Sun AMD 64-bit machine is capable of running Solaris 10, Linux and Microsoft Windows. Same for the new IBM server families, which are capable of running IBM proprietary OSes as well as Windows and Linux.

As you will see, the real-time Enterprise is adopting the same general-purpose rule for the new staff member's profile. The more multi-functional and general-purpose a staff member is – the better. She must be an expert in one domain – say storage administration or Java programming, but she's also expected to be good enough in Unix administration or in Python. Because when she's idle, she then can be easily repurposed.

A Case Study

The following is a case study of a real infrastructure group that could no longer process and react to the endless, unpredictable flow of events (i.e. request for infra additions, system changes etc.).

The Enterprise was suffering from repeating infrastructure failures and downtimes.

Customers were complaining about delays in delivery time, about the quality of the deliverables, and about bad customer experience in their interaction with the members of the infra group.

The employees of the infra group complained (in their turn) about insupportable customers, who were always coming at the last minute with an ASAP request; about arrogant customers who treated them like technicians, rather than engineers, refusing to provide meaningful details about their project; and finally about "ongoing" – a never ending list of tasks that eliminated any possibility to do something new and exciting.

Briefly, an Enterprise Classic (Or, "When the CIO is giving a call to the nearest IT Outsourcing shop").

The Infra Group Structure & Staff Profile

This group had the traditional, function-oriented infra organization structure: a group manager, department managers and team leaders, with each team representing a single function, such as "Unix system", "Storage", "Databases" and so forth. Each team was a silo with minimal interactions with the other infra teams – except for crisis time.

Team leaders were professional geeks, who've been promoted because they were so damn good in what they were doing. But as you can imagine, they were not necessarily as excellent in tasks management…

The department managers, each in charge of several teams, were mostly veterans in their domain: no longer geeks but highly experienced.

Interaction with the Infra Group

As said, each team was functioning as a silo. This was not the outcome of a structural design – it simply happened this way. Therefore, when project managers from across the entire Enterprise needed an infra solution, they had to deconstruct the solution into its components and open a request with each of the infra teams. For instance, if a PM needed a database server, she would file a request against the unix team, asking for a server; the storage team – asking for a storage; and the database team, asking for a database. She would then coordinate the infra teams to have her solution ready on time.

This is a classical point-to-point integration: PM-->db, PM-->storage, PM-->unix – a logical way of integrating with resource in the pre-SOA, point-to-point era.

Some more findings that no one should be surprised about:

1. The human interaction between the infra teams and the rest of Enterprise happened through mails and corridor chats. No methodical way to open requests and to track them down.

2. After a while the infra group did place a front-end system through which PMs and whoever else could open requests to each of the teams. Finally, the actual amount of work became visible, and it was indeed an endless list of to-do tasks.

But a fundamental management issue remained unsolved. Because of the silo nature of the teams, their tasks were not inter-linked nor did they point to the external project that created them. There was, therefore, no way to pin down the tasks that are part of a certain project; there was no way to track down the progress of an external project inside the Infra group, or to prioritize it.

3. Most of the work was done from memory. Even though many of the tasks repeated themselves (creating a database, installing a server, allocating storage etc.) the teams were not using the pre-automation minimal quality guarantee, also known as the check list. As experienced as they were they couldn't beat the devil who's been lurking in the details.

4. The lack of management was visible in other domains as well. Neither consolidated nor up-to-dated asset management existed; no impact analysis of potential changes or of run-time failures was possible, and so forth.

Final Note

A complete lack of governance and an inability to manage & control yielded the inevitable consequences described earlier: exhausted employees, dissatisfied customers, and unplanned outages of Enterprise production systems.

Which is why the flat, ESDR restructuring of this group was so critical.

In the next post I will explain what the flat thing is, and describe the new ESDR strucutre of that infra group.


Post a Comment

<< Home