One of the sysadmin days of all time

All I needed to do was replace a single hard drive.

I’d been receiving notice of some uncorrectable sectors on one of my FreeBSD app servers, and I knew I ought to replace it some time soon. With my weekend plans canceled, I figured it would be a good time to get that over with. I picked out a new hard drive to be the replacement, pulled up some documentation on replacing disks in a zfs mirror, ate breakfast, and got to work.

First, I disabled automatic startup of my jails and VMs in anticipation of needing to do some adjusting after reboot, and I didn’t want a bunch of things competing for the disk during all that. Then I shut down the machine and replaced the hard drive. I should have known that actually screwing the new drive into position was an act of pure hubris.

Problem the first: the machine wasn’t booting from the second hard drive that was in the original mirror. I had somewhat anticipated this since my previous experience with mirrored boot disks had taught me that the boot code isn’t always properly installed on all drives in the mirror. No matter, I thought, and I plugged the old hard drive back in to try booting up and setting up the second drive from a running system.

For those reading in the future: this issue should be fixed in FreeBSD 14.4 and 15.0

After booting up, I copied two jails over to a different host: one running AdGuard Home (without which I had no DNS unless I wanted to adjust my router settings) and one running ZNC so I could connect to IRC for some support. The kind folks on IRC first told me that the boot code should have been present and that maybe I was doing something else wrong, which seemed entirely plausible to me, so I shut down once more and disconnected the old hard drive, but I was unable to find anything wrong in my steps, so it was back to plan A of setting up the second drive from the working system.

But fate had other plans for me on this day. Of all the times, the original boot drive seemed to now fail entirely, and I had an unbootable system. So instead I decided to boot into the FreeBSD installer and figure out how to set this up manually from the shell.

That turned out to be fairly simple: format the EFI partition as FAT32 and copy over a couple of files. Alas, no worky. Well, maybe I need to manually set an EFI boot variable. Back into the installer, set that up, and try again. Still no worky. Bear in mind that each time I want to return to the installer, I need to adjust the boot order in my EFI firmware settings, because it wants to legacy boot by default.

It occurs to me at this point that although the original installation was set up for EFI booting that I may have actually been BIOS booting the whole time. FreeBSD is capable of BIOS booting even with a GPT partition scheme, and a legacy boot partition was present in the table. With that in mind, I boot back into the installer and manually install gptzfsboot into that partition (it was probably already installed, but I was grasping at straws at this point). Still yet, I couldn’t boot back into my system.

At this point, I also notice that not only is my disk not appearing as a boot device, but it’s not even appearing properly in the list of attached SATA devices in the firmware menu. It doesn’t say “not connected” like the empty slots do, but it’s just a blank name, and upon inspection, it thinks the drive has a capacity of 0.0GB! So I find the latest EFI firmware and install that to no avail.

So now I’m thinking that maybe this drive just has a compatibility issue with this motherboard’s firmware and decide to copy the contents onto a different drive that I know for certain will work. I notice that my disk copying appliance takes a long time to initialize, which was suspicious, but the copying procedure did appear to start, so I took a little break while that was running.

After a couple of hours copying, I put that drive into my server (not screwing it into place just yet), and although it still did not appear as a selectable boot device, it did appear properly in the list of SATA devices. Strangely, though, after exiting the EFI menu, despite not being present in the boot priority list, it actually booted up from the new disk!

Seizing this opportunity, I sshed in from my PC and started setting up the first new drive that I had intended to install. I set up its partition table and copied the EFI files and boot code, but I didn’t yet want to add it to the pool and start the resilvering process with the source drive just hanging loose in the case.

I shut down the machine, screwed the working boot drive into position, and in a second act of hubris, I put the lid back on and seated the machine back in the rack. I connected the power, and sure enough, it booted right up as expected.

At long last, I connected once more and was able to add the new drive to the pool, wait 16 minutes for resilvering, and remove the reference to the old disk from the pool configuration. I reactivated my automatic startups and manually started my jails and VMs.

That is the story of how I lost both disks in my RAID 1 pair and still managed to get back online without having to reinstall and restore from a backup.

Epilogue

This morning, it occurred to me to check and see if this machine was even booting in UEFI mode in the first place. Nope. So I could have skipped the whole EFI partition business altogether. Oh well. At least this is in theory bootable on an EFI-only motherboard should I need to switch.

The second drive from the original mirror pair also appears to have some deeper issues of its own. After I got my server back online, I decided I would wipe it and put it away for later use on a different system, but apparently the weird drive info was not just a motherboard compatibility problem. The software I was using to wipe it complained that the SATA drive info didn’t make any sense, and I saw that it appeared to be responding with all zeroes to any drive info query, so I think something has gone wrong either on its control board or some NVRAM used by the firmware that contains the drive info, so I’m considering this one dead, too, even though apparently data can still be copied from it.