Preventing Data Loss from a RAID or Server Crash

April 14

CBL Data Recovery finds preventable mistakes at the core of data loss from RAID and Server crashes

Each day CBL Data Recovery deals with multi-drive servers and RAID arrays which have crashed causing panic, crippling data loss and potential financial loss to the companies who are dependant on those files. According to a recent IDC report, while worldwide server shipments were up 2.9% in 2014 the number of midrange servers jumped 21.2% driven on companies’ goals of consolidating IT infrastructure and taking advantage of the new virtualization environments available. Translation: more and more small and mid-sized businesses are putting most of their day-to-day business needs [eggs] into a single server [basket]. Virtualized environments give business the look and feel of having several different servers, functions and operating systems separated while actually running off of the same hardware. You can have the company file share on one server, accounting and CRM on another, and online order tracking on another – yet all these services are running from one physical machine.

The small and mid-sized business segment is also the group that, unfortunately, ignores their IT investment the most. For the most part small and mid-sized business will not have a dedicated IT department. At best, it might be a couple of tech-savvy employees who also have other day-to-day responsibilities to take care of. Perhaps it’s the lean market times we live in, perhaps it’s an over-confidence in their own abilities, or perhaps it’s the apparent simplicity of running a server which contributes to the decision of not investing in maintenance or bringing in a managed service provider or IT specialist on a regular basis.

A story we hear frequently at CBL by both IT groups and managed service providers is that a new small or mid-sized client called them after a server crash and the client has no idea how or why a failure happened. The situation has often gone so awry after multiple attempts by the client to rebuild a failed RAID array or restore incomplete or old backups that full recovery is impossible. Let’s take look at some of the hard truths behind these situations.

RAIDs Will Fail but Data Disasters Can Be Averted

All servers of any size, configuration, or RAID level will encounter some sort of hard drive failure during their lifetime – it is inevitable. Reacting to a single drive failure or offline drive needs to be done swiftly to avoid turning a temporary hiccup into a data disaster. The majority of mid-range servers on the market are designed as RAID 5 and many of these arrays will contain between 3 and 8 hard drives. A RAID 5 setup can sustain a single unit going offline, but not two.

Modern day servers can usually be configured to email alerts to a system administrator(s) making them aware of the system status and potential drive issues so there should almost be no excuse for not knowing and dealing with hard drive failures when they occur.

Multi-drive servers and RAIDs can also be configured with ‘hot-swap’ hard drives which, if set up correctly, will automatically take the place of a failed hard drive in the array. Even though disaster is averted, that doesn’t mean replacing the failed unit can wait. In fact, having an offline hard drive still physically installed in a RAID array which was previously a member of that array can create all kinds of data integrity issues should the system reboot and that offline hard drive comes back online and is accepted as a member again.

When that first drive fails, ordering a replacement hard drive and installing it should be done ASAP. Giving the server a complete check-up at that point can’t hurt either.

Upkeep Key to Uptime

While servers are designed to run 24/7, like anything else they still require a bit of maintenance from time-to-time. Keeping current with OS improvements, firmware updates, and security patches will help prevent system crashes and extend the life of a system. Sometimes these updates might require a slight bit of downtime and/or rebooting of the system so it should be understood that having a server does not mean 100% uptime, but rather 99%. Accept this fact and schedule for the time IT staff or a managed service provider needs to do system maintenance and ensure completeness.

Let’s remember that servers don’t just have software to be maintained. IT staff or managed service providers should also make sure that the physical components are all in working condition. They can check the CPU cooling fans, system cooling fans, redundant power supplies, and hard drives by opening up the server case and visually inspecting what is going on inside. The uninterruptible power supply (UPS) along with your all-important backups can be checked at the same time.

Backing up Only Part of Disaster Response

If the most common piece of advice anyone ever gives in IT is to “backup, backup, backup” perhaps the least common piece of advice is to “verify your backup”. Locating the most recent server backup and restoring the files to a secondary server, or even a temporary file share, should not be a complex process. Frequently CBL deals with businesses and IT groups who have tried to address their failed server or RAID situation ineffectively, discovering afterwards that their last known good backup was either non-existent, failed, or out of date. Having a disaster response plan is essential, and should include simulating and recovering from disasters.

Understanding the full status of a backup allows for a better decision making process when it comes to addressing a server failure. If one assumes the backup is complete, or if by design the backup can only be restored to the same server, IT staff or the managed service provider must do whatever is required to “get the server up and running”. This can include replacing and rebuilding hard drives; initializing the RAID; and reinstalling the operating system. Once the “assumed good” backup is connected and restored, discovering the important data is actually not available after the server has been wiped clean and overwritten can make one feel sick.

If a backup is tested and found to be inaccessible or incomplete, how the IT staff or managed service provider approaches the server rebuild must change. Simply ‘trying this and trying that’ without keeping a keen eye on the company’s important data is simply ‘not good enough’. Only experienced IT staff with knowledge of what their actions are actually doing behind-the-scenes should be working on the failed system. Does it make sense to re-seat the drives? Can the unit be brought back to degraded mode? Does replacing the RAID controller card import the existing RAID setup or allow for manual input? Will replacing that same RAID controller card result in the automatic initialization of the environment? How long with all these rebuild procedures take? These are very basic questions which need to be fully considered and can completely change the direction IT staff takes when handling a server failure.

Finally, if all data can be restored to a temporary space then replacing the media and re-installing the system on the server environment becomes a less stressful experience. Having a backup plan in place which allows for the restoration on a secondary server should almost be a critical design feature in any company’s server environment.

While the best laid plans can go awry, addressing a lot of the potential failure and recovery situations at the server design phase, having the necessary equipment in place, and putting time into regular maintenance can significantly help in avoiding data loss due to server crashes. Investing a small amount of time and money, mainly through design and good planning, small and mid-sized businesses can happily work for years with a midrange server or RAID array.

Category: data loss prevention

Tags: data center, data loss prevention, failed raid, raid array failure, verified backup

Comments

Commenting is closed for this article.