GNU ddrescue – Saving a Failing, Corrupt RAID Array

Overview

If you’re reading this, you’re probably having an awful day — trying to save as much data as you can from a failing hard drive or RAID array. Hoping this is able to help.

  1. dd: The original, direct cloning utility. Simple, oldschool, and limited.
  2. dd-rescue/dd_rescue: Written by Kurt Garloff, active, viable..
  3. ddrescue: GNU ddrescue, the bee’s knees of data recovery.

GNU ddrescue, is the only utility I can recommend for emergency data extraction, for the following reasons:

  1. Gracefully handles failed blocks, unreadable sectors, other errors, and keeps going until the job is done.
  2. Able to resume where it left off, run in reverse, and skip past failed blocks/sectors.
  3. Provides a reasonable time-estimate and progress bar.

I am running ddrescue from a bootable ISO of the wonderful, useful, life-saving SystemRescueCD.

Setting

A completely avoidable scenario — if recommendations were taken and proper backups were in place.

I was handed a 9-year old PowerEdge T320, 4-drive RAID5 array (I know right?), where upgrades and replacements were rejected for at least five years, backups were refused (A “freebie” USBHDD and Veeam Free was installed to CYA) until it was too late. Who gets to the do the cleanup? That’s right, me.

I connected a second, 4-drive RAID array, (RAID10) as the target drive, connected into the same RAID PERC card. I had to borrow the power supply from another server, due to the power connector required for the custom SAS>SATA splitter. This would at least let everything run at the highest throughput the RAID card could handle.

Amount of data to recover was 5TB. After this was all over, the total amount of unrecoverable data that caused this absolute nightmare, ending up totaling to 4KB, across 8 failed 512B sectors. Insignificant, but enough to make this process from inconvenient to painful.

Macrium Reflect, Norton Ghost 3.0 (old-school), Clonezilla, Acronis TrueImage — absolutely everything would fail out due to I/O errors or unreadable sectors. Even checking “ignore invalid sectors”, still halted on all attempts. Come in GNU ddrescue to save the day.

The “Save the day” Code Block

/dev/sda is the source (unhealthy) RAID array.
/dev/sdb is a healthy target RAID array I added onto the server, using the same RAID card with a second SAS>SATA connector.

In this case, I am mirroring bare disk to disk. ddrescue supports exporting the source into an image/file, and recovering vice-versa.

Identify your disks. Any of the below can get the info for you.
fdisk -l
hwinfo --short
lsblk -o name,label,size,fstype,model
Drive: Going FROM: /dev/sda TO: /dev/sdb
-f 		Allow (over)write
-n 		Skip Splits (massive speed improvement for unhealthy drives)
-O 		Reopen Input after every Read Error
-K #M 		When encountering an error, skip forward X megabytes. Extremely important if your rescue slows down after encountering an error.
-R		Reverse mode. From the end of the drive towards the first.
recovery file Allows immediate resuming of the transfer at any point.


ddrescue -f -n -O -K 1M /dev/sda /dev/sdb recovery.log (First half)
ddrescue -f -n -O -K 1M -R /dev/sda /dev/sdb recovery.log (Reverse - Second Half)

There is an option to force reads and writes to retry # times. These were not able to skip the unreadable sectors in my experience.

-O, forces a fresh input after a read error. By default, when you hit a bad block, ddrescue will slow it’s extraction speed from full (e.g. 500MBps in my case), to safe (4KBps….). -O forces it to resume attempting full speed, until it hits the next bad block, then it forces a fresh input again.

-K #MB was the magic command that made recovery complete in a reasonable amount of time. -O by itself helps significantly, but by itself was not enough.

-K #MB does skip data, but if it’s already unrecoverable, or will take 1 hour to read a single 512B sector, it’s not worth bothering with under a time crunch. The #MB is the “distance” to jump past the bad sector. This can be lowered or raised for how much data you feel “safe” to risk losing.

Final result

In the end, the data was successfully mirrored/cloned to a new array, those 4KB of damaged sectors (8 sectors, skipping 1MB at a time, for a total loss of 8MB) did not affect functionality, presumed to be deleted/stale data.

After DB integrity checks, everything checked out healthy, production resumed, and the client finally agreed to get a new server and a disaster recovery solution.

Good luck in your efforts, hoping this was helpful.

Leave a Reply

Your email address will not be published. Required fields are marked *