Most computer users know the situation where some system or service that worked for months without problems suddenly stops working. I want to tell you of a small war story which happened some weeks ago to us.
We are maintaining an own mail server with imap access for our employees. That allows for relativly easy serverside spam protection, mailing list management and archiving and so on. We use the trusted combination of postfix, cyrus and mailman for the task and everything works very reliably. Then suddenly we got the error message
ssl_error_rx_record_too_long in our e-mail clients. Nothing had changed software-wise. Googleing on the internet brought up all kinds of different obscure reasons for this error but no explanation why something like that would happen out of thin air.
Fortunately, looking in cyrus’ log files quickly showed the reason: the hard drive was full! A two days before the mail system failing there was a larger upload to the server for sharing stuff with a colleague. This upload fitted almost exactly onto the free disk space and some mails later the disk was full. It was really a murphy’s law situation because some kilo bytes less free space would have made the file sharing fail with a sensible error message. But it worked and made the mail system fail suddenly without immediate connection some changes to the server.
There are some lessons to be learned here:
- Aside from file managers most applications assume memory and disk space are unlimited. If they do hit such a limit they usually fail miserably with complete bogus errors.
- Monitor critical resources on important systems to receive warnings ahead of time before important service fail. Tools like Nagios can help here.
- Try to be aware of side effects of your actions. Separating services to different machines may help to reduce unexpected side effects on seemingly unrelated stuff. We used the server to run many different unrelated services.