RedHat Kernel Crash Dumps
As part of Sun's new range of x64 (x86) servers, Sun now sells support for RedHat Enterprise Linux (RHEL) and SUSE Linux Enterprise Server (SLES). This is all well and good and certainly pushes Sun kit into "white box" territory, but we do have a bit of catching up on the knowledge side of things.
I've been identified as a good resource to train up as I've already got a fair amount of Linux knowledge. As part of this, I was lucky enough to go on a customised (for Sun) RHEL training course. Whilst I didn't learn much new, one of the things I did learn about completely amazed me - gathering crash dumps on RHEL. I tell you now, I wasn't amazed at how good it was; quite the contrary.
RHEL, prior to RHEL 4 update 1, only offered the ability to perform a "netdump". That's a crash dump that is transferred across the network to a dedicated "netdump server".
On the one hand this sounds like a good idea and allows for centralised admin, however I see several major problems:
- It's not enabled by default (neither is it on FreeBSD, but that's off the topic)
- Only a select few interfaces are supported
- If you need crash dumps for ALL your machines, you'll need at least 2 dedicated machines - the netdump server will need a netdump server too
- The crash dump is transferred to the remote machine on the way down
- The crash dump is transferred in clear text (a lot of it, but it's still clear)
- The crash dump is transferred using UDP
- There's no guarantee it will be complete if the machine panicked due to a watchdog reset as the machine's hardware may reset the machine before the transfer is complete.
I can't help but feel this approach is very much a "crash and pray" way of doing things. There are too many unreliable factors involved. If you enable crash dumps, you want to know why your machine crashed and for that you need a crash dump.
RedHat's rationale behind the decision to use netdump over disk based dump is based on the fact that, unlike traditional UNIX vendors, they don't have control over the varying physical devices involved. Fair enough, but this doesn't seem to be a problem for the BSD distros, Solaris x86 or even SLES (SLES 8 & 9 use LKCD).
I'm not a kernel programmer, but surely if you can actually boot the OS off the disk in order for it to panic in the first place, you have enough control over the device to dump the contents of swap to the disk on the way up.
Dumping to disk does have it's problems and is also susceptible to corruption, however it's more secure, more reliable, doesn't require additional hardware and it will capture full dumps for hardware watchdog events on most occasions.
If you've got to account for every event on a secure business critical machine, I can't help but think the risks associated with a dump to disk are significantly less than those associated with a dump to a network device.
RHEL 4 update 1 now ships with diskdump to allow dump to disk.