A sense of control when it all goes wrong
If one thing is guaranteed in any computer system, at some point something will go wrong. This could be as mundane as a monitor failing, something a little more stressful as a power supply giving up or a bigger issue such as a virus attack or a major data failure. This is where having a good “problem manager” is such an asset. So what is problem management? Well if you look at the ITIL model, it has two key areas, these being “proactive” and “reactive”. Proactive is interested in understanding why you have issues, what they are and then developing an understanding of the trends that appear hopefully with the intention of getting rid of them. Reactive is more about dealing with the big failures as they occur, understanding why they failed and creating (and managing) the remedial actions to stop them happening again.
A small part that is quite under-played in the books is what is simply titled “major incident management”. You will notice that the process includes the word “incident” and should be treated as an extension of the incident process. For a small business, unavailability of a computer system or loss of data could mean a significant loss of profits (compared to large a organisation where I have seen a full office of 1200 unable to work for 2 days, the attitude of the senior managers being that people would naturally catch up over the next week or so without the need for overtime!). If your staff levels are low and you are dependent on single machines carrying out dedicated processes the loss of a key piece of kit could be devastating. This is where the problem manager carrying out the “major incident manager role” is key.
As soon as the fault (normally associated to a “service breach” or loss of service) is identified, the problem manager takes control of the incident.
The first thing to ascertain is the impact – what is the loss of service preventing you from doing and how long do you has to recover it. At this point in time the problem manager is not interested in the possibility that the service will not be recovered, it’s about understanding what the problem is preventing you from doing and the risk to your business. These facts allow the problem manager to make informed decisions and map out the path of recovery. Sometimes they may focus on only recovering part of the service – enough to get you through the initial milestone, on other occasions a more planned approach may be possible. All of the way through, the problem manager is checking the key suppliers to ensure they are working on the problem and any contractual obligations are being met, whilst keeping a steady eye on any key deadlines.
Rolled into this, the problem manager is the ideal person to co-ordinate any communications to the wider business users who may be impacted.
The problem manager’s work does not stop once the service is restored. The next focus is to understand what went wrong and to stop it happening again (and that will be the topic for the next instalment of this blog). But to close this article I would like to bring it back to the context of a small to medium sized business.
To be a good problem manager, you do not have to be technical (says he who has just offended any Problem Manager from a technical background!). Prior to moving into this industry I was a Supermarket Manager, a food logistics planner then an IT systems trainer. A good problem manager needs to have the following skills:
• A good knowledge of the businesses operations so they can think “whats going to be impacted next”
• Confident enough to ask dumb questions, not be phased by senior manager and definitely not believe the techies when they get into in-depth jargon and explanation’s of why they cant fix it
• The ability to take command a situation and direct and plan the restoration (especially if it does not go according to plan)
The role of a problem manager could be seen as a luxury in any organisation (and it actually took us 6 months to prove to the senior manager in a FTSE 100 company that they would benefit from one) but in the event of a major system failure, you will be glad that you created the role.
If you would like more information on this topic email us requesting "PM1 - Managing major system failures"
16th March 2009
Click here for details of our FREE business healthcheck and join the rest of the companies using IT Service Management.
Providing Affordable IT Management to SME's