The expanded incident lifecycle
So you happen to have a system fail that caused your business an amount of disruption or maybe some loss of sales / profit. It could be that you were fortunate and had a support contract in place that allowed you to restore the service, or possibly no contract was in place and you had to source a supplier quickly? Once the service is restored we really need to work out the downtime.
Why ? Well for two main reasons:
1) You may have a support contract that has performance metrics in terms of response and fix times
2) We need to understand how the incident breaks down to look at areas where we could shorten the duration So how do we measure the downtime ?
Well every incident follows what ITIL refers to as “The expanded incident lifecycle”. Each incident will go through various stages with some key measurements being established (and used to reduce the downtime if future outages occur).
1) Incident – The point that the user notices the service failure
2) Detection - The point that the service failure is reported
3) Diagnosis – The point at which someone starts working on the issue
4) Repair – The point at which the cause of the incident has been fixed
5) Recovery – The point at which normal service is technically provided
6) Restoration – The point at which the users commence using the service
So how can we help ourselves shorten the downtime ?
Well these are the key measurements:
a) “Detection elapse time” is the difference between 2 & 1. This is how long it takes the user to report the fault once they have noticed it
b) “Response time” is the difference between 3 & 2. This is how long it takes whoever is fixing it to start working on the problem. For some contract’s this may be a contractual key performance indicator
c) “Repair time” is the difference between 5 & 3. This is how long it takes to get the service back into an operational state. Once again, this may be contractual
d) “Recovery time” is the difference between 6 & 5. This is how long it takes to get users back onto the system once it has been fixed
So really when looking at the time a service is down, you can analyse the “incident lifecycle” in two ways: A & D focus on your operation, as these are under your control. The more efficient you are at reporting the incident and communicating to your users that the service is back, the shorter your incident downtime B & C focus on the technical resource and how they work on your incident (including contractual obligations). In the next chapter of this topic we will drill down into these points in a little more detail
3rd Nov 2009
Providing Affordable IT Management to SME's