The reliability of hardware can unfortunately be an unpredictable phenomenon, and it seems I am bitten once again. About a month ago, my Slackware box would suddenly start ‘going nuts’ every few days, for lack of a better term; network services would no longer launch, common commands could not be found, and directory listings would produce garbage.
(Warning: boring techno-geekery and personal woe to follow.)
A reboot would usually clear things up again, albeit with some minor filesystem corruption and a few lost files, driving home the point that RAID is not a replacement for backups (fortunately my most important data is synced to a second system). Eventually though, I just had to shut the whole system down as there was no point in letting it gradually eat away at my files and spending hours going through the filesystem repair.
So, the next step of course was diagnosis, since I didn’t want my system sitting there unpowered and doing nothing, either. The first clue was that ever since the box had been set up I’d been getting BadCRC warning messages on the third drive. That usually means the cable might be a bit faulty or out-of-spec, but they occurred rarely and are pretty much harmless in isolation since it just retries the transfer again. It was possible that it had deteriorated to the point where the problem was no longer isolated though, so I replaced the cable to the third drive, used the closer connector on it, and switched interface ports on the IDE card as well.
After bringing the system back up, it ran perfectly for five days. And then ‘went nuts’ again, eating another large handful of files.
Looking back through the logs for more clues, there were more IDE error messages, but not of the same BadCRC type; these were DMA timeouts and resets instead, again on the third drive. The third drive was definitely looking suspicious, but was it the drive, or its controller? In order to keep each drive on its own channel, the third drive was attached to a PCI controller card rather than the motherboard controller channels, and it’s possible the card was failing.
Unfortunately it’s not always clear which part is necessarily at fault. The SMART diagnostics on the drive itself were clean, but they don’t catch all possible errors either. It’s all I had to go on though, and on the assumption that the drive is fine and the controller is faulty, I moved the third drive to one of the motherboard channels, and it’s been running fine for ten days now. With no CRC errors, either.
Hopefully that’s the end of it for now, as I really don’t want to have to start replacing drives or even the whole system. I don’t generally buy hardware specifically for this box; it inherits hand-me-downs from the gaming box, and I’m not quite ready to upgrade that one yet.
I miss having a second system… Since the server died, all I have left are three 386s, one 75MHz Prolinea, and an aged IBM that *would* be useful if it wasn’t for the fact that CD drives of any speed or kind aren’t recognized by it (yeah… I don’t get that one, either; HDDs work, but not CD-ROMs).
My crates of spares and hand-me-downs lack the most often-needed parts for old systems, and the few “complete” systems I own fell below even my minimally useable standards about 2 years ago.
In other words, I sympathize. :-)