Get Help Now!
The on-call IT tech is jolted awake from a terrible dream – his heart pounding. Lightning crashes overhead as he glances at the clock – 2:59 a.m. The server isn’t down, it was just a dream.
3:00 a.m. The IT on-call pager goes off. This could mean any number of things: a fire, a break-in, a failed air-conditioner in the server room, or even a main business server crash.
3:25 a.m. The on-call IT tech arrives at the site and evaluates the situation. There is no fire, no evidence of a break-in, and the server room temperature reads a cool 18oC. A quick check of the servers shows that most of them are at a login screen. After checking two or three machines, it is obvious that the room lost power at some point. The UPS units verify a failure; all three massive battery units are showing failures and heavy load percentages.
3:40 a.m. The on-call IT tech calls the lead technician and department manager and informs them of the situation; both are on their way to the site. They leave instructions to check the main business application servers; one of them holds the company’s customer database, payroll, and accounting system, and the other is the company’s messaging server.
3:55 a.m. The on-call IT tech discovers that the RAID array for the business database server is not coming back online. The messaging server has rebooted but the messaging application is returning errors when it starts up. The tech realizes that the messaging server was performing incremental backups during the time of the outage. The on-call IT tech decides to leave that to the lead technician when he arrives.
4:00 a.m. The lead tech and manager arrive. Assessments of the other servers are made. The lead tech begins working with the messaging server. The on-call tech works with the failed RAID array. The firmware shows the array has failed; the controller only recognizes three of the ten drives. After a complete power down and restart of the server and drive enclosure, the firmware shows the drives are back online, however the array is shown as ’Failed’.
4:30 a.m. The-on call technician calls the RAID array manufacturer’s technical support. The choices in the firmware menu are vague and the IT Tech wants to know if forcing the drives online will get their array back. The manufacturer’s technical support says that the array will come back; however, there is a slight possibility that the data on the volume may be corrupted. The manufacturer’s technical support asks how recent their latest backup is. The IT Tech responds that the data is one week old and that is unacceptable; they cannot lose a week of transactions. The IT Tech hesitates in deciding what to do next…
Business system disasters like this happen every day. Despite the redundancy in backup systems or storage array systems, failures occur. Some failures can be hardware related, others can be due to software, and still others are the result of human error or natural disaster.
More and more businesses rely on their corporate server structure and document storage volumes. Some businesses rely completely on their database system, which may be financial data, job tracking data, or customer contact data. Other businesses may rely wholly on their messaging database and that is a critical business system. Some telephone systems actually convert voice messages to email notifications, thereby using the email-messaging server as part of the communication system. Today’s systems are also storage systems for all of the documents that users create.
Common Scenarios of Server Data Disasters
Ontrack Data Recovery has been the undisputed leader in the industry with the most technologically advanced data recovery solutions available. We have been serving customers globally for nearly 20 years with offices, cleanrooms, engineers, and employees located around the world. During that time, we have seen many data loss situations ranging from commonplace to unique. Here is a sampling of specific types of disasters accompanied with actual engineering notes from recent Remote Data Recovery jobs (Evaluation time represents the time it takes to evaluate the problem, make necessary file system changes to access data, and to report on all of the directories and files that can be recovered):
Causes of Partition/Volume/File System Corruption Disasters
Causes of Specific File Error Disasters
Possible Causes of Hardware Related Disasters
Causes of Software Related Disasters
Causes of User Error Disasters
Causes of Operating System Related Disasters
From legacy systems and post-mainframe storage devices to the latest high-end SANs, Ontrack Data Recovery works on them all. More importantly is the validity of the recovered data—the data must be usable to the client when we have completed the recovery.
Server Recovery Tips
Data disasters will happen, accepting that reality is the first step in preparing a comprehensive disaster plan. Time is always against an IT team when a disaster strikes, therefore the details of a disaster plan are critical for success.
Here are some suggestions from Ontrack Data Recovery engineers of what not to do:
Ontrack Data Recovery should be part of your disaster planning and your key personnel should be aware of our recovery capabilities. During an outage, it is common to have multiple recovery efforts going on at the same time. This makes sense because the goal is to get the company back to its data. The key to success is to get Ontrack Data Recovery involved as soon as possible.
One client early last year gave Ontrack Data Recovery this challenge, “We have a backup restoration going on right now and we need the data available as soon as possible. If you want the job, you have to beat the tape.” Recovery engineers worked the entire weekend to get the more than 2TB of data available and recovered over before the start of the work week.
Summary and Conclusion
The fictional, true-to-life IT scenarios at the beginning of this article illustrate the situations and decisions that IT staff must make. Businesses and institutions like yours, without access to their data, run the risk of losing millions in revenue every day. The fact is, today’s systems are relied on more then ever for consistent and available data.
Ontrack Data Recovery recognizes the importance of the speed and quality of recovery– especially on large servers. As your partner, we are continually researching new data recovery tools, improving our existing data recovery software tools, and expanding our recovery capabilities to meet your needs for immediate recovery of lost data, including data on large server systems. Successful disaster planning includes having Ontrack’s emergency number (1-800-872-2599) near your computer systems.