2014-08-23

mdadm + lvm + dead harddisk = cluster-f-up

So another rant and this time it's not about ssl! I promise! really!

So there were two nice disks in my server. What do we do with disks in servers? raid of of course! Data must be safe an all... But since this server is on the cheap side of things, it doesn't have a (true) raid controller. So what do we use anyway? mdadm of course. and lvm on top, just for good measure. Actually we used it to seperate some stuff like user homes, certain data, just so we don't wake up one day with our server unable to boot because of some runaway 300GB log file...
Then one of the disks died. But hey, we have a raid right? One support ticket and five minutes later the server has a new disk and isn't booting. Well, the bootloader was on the first disk, so fair enough, lets sync the boot partition (on a seperate raid) first and lets do the rest later without downtime.
me: So how does one go about removing absent disk from a software raid?
mdadm: NO.
me: What?
mdadm: NO. JUST NO
me: But the disk isn't coming back, I just want to...
mdadm: NO.
me: okay, mdadm if you don't want to remove it fine, just take another one on top, will you?
mdadm: Fine, but I will only take it as spare :P
me: enough already, I'm increasing your active disk number to 3.
mdadm: Wait, I'm confused, now I have 1 working disk, 2 disk with the same id as the working one, but they are missing, and 1 spare.
me: wtf?!
me: internet to the rescue! *google*
internet: erm, well, nope.
me: really?
internet: well, you could just re-create the array and then add the new disk. only solution add all.
me: seems kind of dangerous, but everyones talking about no dataloss
one reboot later...
me: raid: go!
raid: ok
me: lvm go!
...
me: lvm?
...
me: god dammit!
some hours later I was looking at binary dumped raid headers in vim in hex mode, diffed side by side. let that sink in for moment. there were some insignificant differences between the working version without data from the new spare disk and the not working version with data from the old disk. differences like creation date or uuid. nothing a one-disk-present-one-disk-missing-newly-created raid should stop from spewing up it's contained lvm.
After that I dumped the first 500mb of the disk, vim, hex mode again, and looked for the lvm header itself. first good message that day: it was present. I could even pinpoint it to the byte on disk. However pvck (lvm fsck-like tool to check for lvm headers) did not find it, however much I told it where to look.
Well that's as far as I got before saying "screw it, I'm using btrfs with integrated raid and subvolumes and safe the data later".
Then I reinstalled Gentoo, quite smoothly.
Except for ssl certificates of course, they still suck, but I promised not to talk about that again. This time.

No comments:

Post a Comment