Enterprise Logging (The Recursive Enterprise, Part II)
This is the 2nd post on the subject of the Recursive Enterprise.
In the previous post I have described a fractalized, recursive Enterprise - a Babushka of Service inside a Service inside a Service. I terminated the post wondering how we could possibly pinpoint the location of a potential problem in such recursive-yet-distributed business process architecture and suggested that Enterprise Logging could be something worth looking at.
So just before discussing Enterprise Logging and weighing its pros & cons against the challenge we got, I suggest we take a tour on the current state of Logging in the Enterprise (which is totally different than Enterprise Logging…). I'll do it short and dry.
1. Most of the 3rd party elements, software packages, middlewares and appliances are logging their state & status.
2. On the other hand, most of the in-house, bespoke applications, are suffering from serious shortage in logs.
3. Not all logs are made equal. Actually, logs are annoyingly resistible to any standardization attempts (not that there are that many logging standards out there, but still there are some around). Logs differ in their payload format; their content; their semantics; their distribution channel and so forth.
4. Functional, business applications are almost always logging exceptions and/or functional misbehaviors. They do not log state (operational as well as functional), nor status (KPIs [Key Performance Indicator], KQIs [Key Quality Indicator]). The bitter truth is that most applications do not even bother to collect this information.
5. Logs are mostly ignored until a failure occurs.
I'd say this is enough to make one thing clear: in the current Enterprise Architecture there's no way we can methodically pinpoint a problem in an SOA/GRID Business Process; so along with SOA & GRID a serious, exhaustive retouch must take place, or otherwise those magnificent Services will become black holes.
(Note: actually, this logging state prevents also non-SOA Enterprises from properly handling production faults or any other issues related to distributed systems. Simply, as stated many times already, SOA is aggravating the situation. It's the difference between difficult but somehow possible to impossible).
So let's do a quick retouch:
All in-house, bespoke applications (and Enterprise Services fall well in this category), must report on specific categories, in a specific way, on a specific time. Differently put, we must have some kind of a Logging standard. A surprisingly good logging standard that covers format, content and situation is the IBM Common Base Event model, launched as part of IBM's autonomous computing initiative. IBM figured out (quickly?) that if they want to have a framework that understands what's going on so it can fix things, there's one thing they can no longer avoid – standardizing the logging of their entire stack. I strongly recommend reading (and using) this standard.
The Common Base Event model or alike, could be (and should be) introduced into newly built applications; but what about the legacy ones (The other thousand logs and a log formats, contents etc.)? They clearly should get translated as well, or the common base event model would be just another log format. Remember: languages were created to generate chaos; we aspire at minimizing chaos, hence the logical attempt to revert to a unified, pre-Babylonian language.
OK. We'll employ whatever technique to transform all logs into the common base event model. But then what?
When loaded into a data warehouse, this standardized, enterprise-wide log provides a magnificent panoramic view of the entire Enterprise landscape. Imagine that any programmer, operator or administrator can login into one system, through the same UI and see historical as well as real-time events from whatever is the object of interest: a service, an application, a router or a database: all in one. This infrastructure also lay the foundations for correlation, data mining, prediction, prognosis, capacity planning and many more enterprise architecture efforts. (you can have a look at Microsoft case study, documenting this Enterprise Logging System).
This warehouse, if based on [near] real-time events, can serve for manual pinpointing of [near] real-time problems in a complex, recursive, business process, given the knowledge of its topology. This knowledge may be partially documented or captured in the heads of some IT people, but essentially it does not exist. When Enterprises will start using dynamic service binding, topological knowledge would have to be automatically generated. This feature will be part of the new Enterprise Architecture Framework, where objects will not be created, configured, monitored & controlled by humans, but rather by the machines.