Quick approach to carrying out a Risk Review
I thought I would start with a little story about a problem my team had to deal with last month. The service in question was what is known as a “data warehouse”. In simple terms this is a big stack of disk where data such as sales, customer and stock information is dumped each day in a set of predefined tables. Tools are then used to query the data so the company can make informed decisions very quickly.
A great service and one that has become very critical. Critical to the business, but not so much that when the Service Delivery Manager and the service owner has said to the business “you need to spend money to get the level of resilience you want”, the business has gone “yeh OK, maybe next year…” So, the service is ticking along quite nicely when a key component (the system board on the staging server) goes bang. Now this service has broken most of the rules of good service management as follows:
1) The server had been taken off support as a project to replace it was under way. Unfortunately the new server was delayed and no one had pushed the decision to put the old server back onto support
2) The old server was that old that spares were not available. Worse still it was bespoke hardware and operating system so an off the shelf box could not replace it
3) All of the data that was needed for the new server (which was about ½ days work to get it ready) was on the old server. It had been copied 6 months ago but all of the changes had not been recorded and were not saved anywhere
So here we were with a chassis that won’t power up but a set of disks that we think are OK AND have the data that we required but can't be read. Finally we have a server ready to accept the data.
So where is this story going? Well two directions really. The first part is to introduce one of the concepts of IT Continuity and then following that, to dicsuss how we got the service back………..
So what’s the concept? Well it is really broke into two parts, these being understanding the risk and the probability of it happening and the second part having a set of actions of how to deal with it.
So reflecting back on our SME with a number of desktop computers, a central server to store the data, a broadband or ISDN internet connection and a single critical application which sits on a separate server which all of the PC’s connect to. For each of these services we have got to ask three questions
1) What could wrong
2) What would be the impact of it going wrong
3) What is the likelihood of it going wrong
By reviewing your services this way you will get a feel for what is both critical and likely. From this you will quickly identify your risk areas. If you would like to see an example of a risk review based upon the question above to give you a starting point, please email us quoting “Risk Review Example”
In the next article of this topic we will start to explore how these risks can be mitigated or removed but I did say I would explain how we would get the service back.
The starting point was to get a range of technical people in a team and brainstorm the problem above. These were our internal data warehouse specialists, members of other technical resolver teams and our problem manager to drive the discussion.What we ended up with was simple and worked!
The disks were sent to a data recovery company (due to the bespoke nature of the infrastructure we needed a specialist organisaton to read them) who extracted the data files that were needed. They did this by "bodging" the server well enough to bring it to life but not to a level that would sustain a service. The data files were then saved to media in a manner that could be transferred to the new server. This did involve me getting in a car and doing a 6 hr round trip plus waiting time but as this was the only copy of the data that existed we did not want to put the disks into the hands of a courier...
