It was bound to happen. After 4+ years of running multiple NAS units 24x7, I finally ended up in a situation that brought my data availability to a complete halt. Even though I perform RAID rebuild as part of every NAS evaluation, I have never had the necessity to do one in the course of regular usage. On two occasions (once with a Seagate Barracuda 1 TB drive in a Netgear NV+ v2 and another time with a Samsung Spinpoint 1 TB drive in a QNAP TS-659 Pro II), the NAS UI complained about increasing reallocated sector counts on the drive and I promptly backed up the data and reinitialized the units with new drives.
Failure Symptoms
I woke up last Saturday morning to incessant beeping from the recently commissioned Synology DS414j. All four HDD lights were blinking furiously and the status light was glowing orange. The unit's web UI was inaccessible. Left with no other option, I powered down the unit with a long press of the front panel power button and restarted it. This time around, the web UI was accessible, but I was presented with the dreaded message that there were no hard drives in the unit.
Data Availability at Stake
In my original DS414j review, I had indicated its suitability as a backup NAS. After prolonged usage, it was re-purposed slightly. The Cloud Station and related packages were uninstalled as they simply refused to let the disks go to sleep. However, I created a shared folder for storing data and mapped it on a Windows 8.1 VM in the QNAP TS-451 NAS (that is currently under evaluation). By configuring that shared folder as the local path for QSync (QNAP's Dropbox-like package), I intended to get any data uploaded to the DS414j's shared folder backed up in real time to the QNAP AT-TS-451's QSync folder (and vice-versa). The net result was that I was expecting data to be backed up irrespective of whether I uploaded it to the TS-451 or the DS414j. Almost all the data I was storing on the NAS units at that time was being generated by benchmark runs for various reviews in progress.
My first task after seeing the 'hard disk not present' message on the DS414j web page was to ensure that my data backup was up to date on the QNAP TS-451. I had copied over some results to the DS414j on Friday afternoon, but, to my consternation, I found that QSync had failed me. The updates that had occurred in the mapped Samba share hadn't reflected properly on to the QSync folder in the TS-451 (the last version seemed to be from Thursday night, which leads me to suspect that QSync wasn't doing real-time monitoring / updates, or, it was not recognizing updates made to a monitored folder from another machine). In any case, I had apparently lost a day's work (machine time, mainly) worth of data.
Given the failure symptoms (and the low probability of all the four hard drives in the DS414j failing at the same time), I was cautiously optimistic of recovering the data from the drives. One option would have been to put the four drives in another DS414j (or another 4-bay Synology NAS unit) and hoping disk migration would work. However, with no access to such a unit, this option was quickly ruled out.
In many of our NAS reviews, I had seen readers ask questions about data recovery from the units using standard PCs. In the review of the LG N2A2 NAS, I had covered data recovery from a mdadm-based RAID-1 volume disk member using UFS Explorer Standard Recovery v4.9.2. Since then, I have tended to prefer open source software while keeping ease of use in mind.
Recovering RAID-1 Data using UFS Explorer Standard Recovery
Searching online for data recovery options for a failed Synology NAS didn't yield any particularly promising results for Windows users. From an open source perspective, Christophe Grenier's TestDisk appeared to be able to perform the task. However, with no full featured GUI and / or instructions for recovery in this particular case (4-disk RAID-5 volume), I fell back upon UFS Explorer for a quicker turn-around. My only worry was that I hadn't used standard RAID-5 while creating the volume, but Synology Hybrid RAID (SHR) with 1-disk fault tolerance. Though it was effectively RAID-5 with the 4x 2TB drives in the NAS, I wasn't completely sure whether the software would recognize the RAID volume.
Synology does have a FAQ entry covering this type of unfortunate event for users willing to work with Ubuntu. This involves booting Ubuntu on a PC with the drives connected, installing mdadm and using that to recognize the RAID volume created by the Synology NAS.
Data Recovery from Synology NAS Drives using Ubuntu
The pros and cons of the two data recovery software alternatives are summarized below:
- Windows + UFS Explorer
- Pro - Intuitive and easy to use / minimal effort needed for users running Windows on the relevant PC
- Con - A licensed version of UFS Explorer costs around $200
- Ubuntu + mdadm
- Pro - Free
- Con - Complicated for users without knowledge of Linux / users not comfortable with the command line
- Con - Synology's FAQ doesn't cover all possible scenarios
Evaluating the Hardware Options
UFS Explorer can take in disk images for RAID reconstruction. The hardware in my possession that came to mind immediately were our DAS testbed (the Asus Z97-PRO (Wi-Fi ac) in the Corsair Air 540 with two hot-swap bays configured) and the recently reviewed LaCie 2big Thunderbolt 2 / USB 3.0 12 TB DAS unit. My initial plan was to image the four drives one by one into the DAS and then load the images into UFS Explorer. I started the imaging of the first drive (using ODIN) and it indicated a run time of around 4.5 hours for the disk. After starting that process, I began to rummage through my parts closet and came upon the StarTech SATA duplicator / eSATA dock that we had reviewed back in 2011. Along with that, I also happened to get hold of a eSATA - SATA cable.
The Asus Z97-PRO (Wi-Fi ac) in our DAS testbed had two spare SATA slots (after using two for the hot swap bays and one each for the boot SSD and the Blu-ray drive). Now, it would have been possible for me bring out the two SATA ports and appropriate power cables from the other side of the Corsair Air 540 chassis to connect all the four drives simultaneously, but I had decided against it because of the difficulties arising due to the positioning of the SATA ports on the board (I would have considered had the ports been positioned vertically, but all six on the board are horizontal relative to the board surface). However, with the StarTech dock, I just had to connect the eSATA - SATA cable in one of the ports. There was no need to bring out the SATA power cables from the other side either (the dock had an external power supply).
Click on image for component details
Our DAS testbed runs Windows with a 400 GB Seagate boot SSD as the only SATA drive permnanelty connected to it. I wasn't about to install Ubuntu / dual boot this machine for this unexpected scenario, but a live CD (as suggested in Synology's FAQ) with temporary mdadm installation was also not to my liking (in case I needed to reuse the setup / had to reboot in the process). Initially, I tried out a 'live CD with persistence' install on a USB drive. In the end, I decided to go with a portable installed system, which, unlike a persistent install, can be upgraded / updated without issues. I used a Corsair Voyager GT USB 3.0 128 GB thumb drive to create a 'Ubuntu-to-go' portable installation in which I installed mdadm and lvm2 manually.
Ubuntu + mdadm
After booting Ubuntu with all the four drives connected, I first used GParted to ensure that all the disks and partitions were being correctly recognized by the OS.
Synology's FAQ presents the ideal scenario where the listed commands work magically to provide the expected results. But, no two cases are really the same. When I tried to follow the FAQ directions, I ended up with a 'No arrays found in config file or automatically' message. No amount of forcing the array assembly helped.
After a bit of reading up the man pages, I decided to look up mdstat and found that a md127 was actually being recognized from the Synology RAID operations. Unfortunately, all the drives had come up with a (S) spare tag. I experimented with some more commands after going through some Ubuntu forum threads.
The trick (in my case) seemed to lie in actually stopping the RAID device with '--stop' prior to executing the forced scan and assemble command suggested by Synology. Once this was done, the RAID volume automatically appeared as a Device in Dolphin (a file explorer program in Ubuntu).
The files could then be viewed and copied over from the volume to another location. As shown above, the ~100 GB of data was safe and sound on the disks. Given the amount of time I had to spend searching online about mdadm, and the difficulties I encountered, I wouldn't be surprised if users short on time / little knowledge of Linux decide to go with a Windows-based solution even if it costs money.
Windows + UFS Explorer
Prior to booting into Windows, I had all the four drives and the LaCie DAS connected to our DAS testbed. The four drives were recognized as having unknown partitions (thanks to most of them being in EXT4 format). However, that was not a problem for UFS Explorer. All the partitions in all the connected drives were recognized correctly and the program even presented the reassembled RAID-5 volume at the very end.
After this, it was a simple process of highlighting the appropriate folders in the right pane and saving it to one of the disks in the DAS.
Fortunately, I had only around 100 GB of data in the DS414j at the time of failure, and I got done with the recovery process less than 10 hours after waking up to the issue.
After ensuring the safety of the data, I tried to reinitialize the NAS using the same drives, but the process conked out towards the end with a message indicating that the process was incomplete and telnet service had been enabled for support personnel to look into. Not satisfied with this situation, I rebooted the unit with the drives removed and the reset button at the back pushed in. After the reboot, I inserted two fresh drives and began the initialization process.
DSM Initialization Process 3 Hours After Starting
DSM Initialization Process More than 8 Hours After Starting
The completion process was stuck at 42% with the second drive's LED blinking continuously for more than 48 hours before I pulled the plug. Synology Assistant had shown the status of the unit as 'Upgrading' the whole time.
Synology support asked me to try initializing with a single drive (completely new, fresh out of the box) in another bay. This changed the failure symptom, but didn't resolve the issue.
The unit was shipped back to Synology and they found that one of the circuit board elements had simply died (no burn-out or any other marks). So, it did turn out to be an internal hardware failure in the end. A small percentage of products shipped by any vendor invariably fails in the early stages of being in the consumer's hands. It was just unfortunate for Synology that it happened to a review unit sampled to the press.
My Saturday plans went haywire, thanks to the DS414j going belly-up. However, I did end up proving that as long as the disks were functional, it is possible to easily recover data from a Synology RAID-5 volume by connecting the drives to a PC and using UFS Explorer. Users wanting more of a challenge can also use Ubuntu and mdadm for the same purpose. In my case, the data was in a SHR (Synology Hybrid RAID) volume with 1-disk redundancy, but the disks were all of the same size (making it RAID-5 effectively).
Lessons that I learned from my data recovery experience:
- Have access to a PC with multiple spare SATA slots, preferably hot-swap capable
- Back up data written to a NAS frequently (if possible, in real-time)
- Have access to a high capacity DAS (with more free space than the largest NAS volume that you may have to recover)
- Avoid encrypting shared folders and/or volumes, if possible
- Prefer straightforward RAID-x volumes compared to customized (note: customized need not necessarily mean proprietary) RAID implementations and/or automatic RAID level management (such as Synology's SHR / Seagate's SimplyRAID / Netgear X-RAID2)
- In critical environments, run two NAS units in high availability (HA) mode
Things I would like from the NAS vendors' side (Synology already ticks most of these):
- Don't use proprietary RAID / hardware RAID for consumer NAS units
- Instead of (or, in addition to) supplying backup software, provide licensed versions of data recovery software such as UFS Explorer (or, supply one developed internally for Windows / Mac / Linux)
- Provide official documentation for recovering data using PCs in case of NAS hardware failure (using either commercial software such as UFS Explorer or open source ones like TestDisk)
Synology alone is not to blame for this situation. If QNAP's QSync had worked properly, I could have simply tried to reinitialize the NAS instead of going through the data recovery process. In any case, I would like to stress that this anecdotal sample point in no way reflects the reliability of Synology's NAS units. I used to run a DS211+ 24x7 without issues for 3 years before retiring it. More recently, our Synology DS1812+ has been running 24x7 for the last one year as a syslog server. The DS414j which failed on me has been in operation for less than two months. I put it down to the 'infant mortality' component in the reliability engineering 'bathtub curve'. Synology provides a 2-year warranty on the DS414j, and any end-users affected by such hardware issues are definitely protected. One just needs to make sure that the data on the NAS is backed up frequently.
0 comments:
Post a Comment