Troubleshooting my offline Zpool

It’s a quiet Sunday, and I wasn’t planning on writing an article.

There I was copying files and doing some maintenance, and my network drive was offline. I figured I must have done something dumb, so I logged into my server and checked. My 8 x 6TB iron wolf raid-z2 zfs array was offline. So much for a quiet day.

Four of the eight disks were showing errors. And the ‘lsblk’ command could only find four of the eight disks:

Where have my drives gone?

In fact, I was a little relieved – one drive error might be real, but I thought 4 is probably a glitch. Hopefully software, but I have to troubleshot to find out. Here’s what I did. Firstly, server reboot – that should fix software issues, if any. It almost worked too: The drives reappeared, and the raid away came back to life. But then it died a few minutes later during a scrub I initiated. Again, FOUR disks gave errors. It’s probably not the software.

So I rebooted the server, logged into the IPMI interface and spammed the delete key a few times so I could check interrupt the reboot and enter the bios setup screen of my H12SSi-NT motherboard. I wanted to see what the motherboard could detect. The H12 motherboard has a pair of slim-SAS connectors, and I was using all of one of them:

Both 8-port SATA connectors showed up, but I still wondered if the port I was using was somehow at fault (it’s a new motherboard… and wouldn’t make me smile if it was dead already). So I powered off, switched SAS port connectors and rebooted.

At power-up, however, the zpool array was still dead with four drives not showing.

Believe it or not, I felt BETTER: the chances of both SAS ports faulting is…low. And if the SATA ports were both working properly then it’s probably NOT the motherboard: remember that I said four drives were dead? Well each pair of four-drives is powered by a separate power cable connected to the single power supply. Could this be a dodgy power connection?

So I took the cover off and juggled the SATA power leads a little on each drive and on each power connector to the power supply. All the leads were all clicked-in-place, so I couldn’t easily see a problem. But I rebooted anyway as it’s an easy check. Wonder of wonders, on power-up, all eight drives reappeared and the zpool imported without issue.

As I type, I am scrubbing the zpool…but I am also going to order a new SATA power cable as I can’t really expect a ‘cable-jiggle’ to be a good long-term solution.

I also put my SAS connector back to the original port as the cabling was less stressful (I would have to re-route the cable to use that port permanently):

So the GOOD news is, I think it’s an inexpensive problem: a power lead. The BETTER news is that by systematically checking out the potential problems, I have a likely root-cause and a short-term fix (‘jiggling power leads’). I also have an executable plan for eliminating this (i.e. buy new (different?) power lead(s) for the drives).

The takeaway? Check one thing at a time. 🙂

Enjoy your Sunday!

By Andrew Wilson

Twitter @OGSelfHosting