‘check_yum’ nagios-plugin can starve your target host if the process is interrupted and the lock file is never cleared.

Mystery of hanging processes

At Kaaos Unlimited Oy, we run icinga 1.x to monitor our hosts. Last week I noticed that ‘check for updates’ -check constantly failed/timed out on one of our hosts. In addition, the host in question started to show signs of performance degradation and the loads were indeed slowly going up. It got so bad, that even trying to login via SSH into the host became impossible.

Rebooting the server (forcefully) was successfull and for a while things seemed to run ok, except the icinga still timed out on ‘check for updates’. Fast forward couple of hours and I finally had some time to investigate.

Investigation

Not really knowing where to start, I ran ps -aux which immediately showed me that there were several ‘check_yum’ processes running. I knew this shouldn’t be the case. BTW, I had always imagined that icinga/nagios plugins running on the target host would eventually timeout themselves. You always learn…

I killed the dangling processes with sudo pkill -9 yum and then tried running the check manually sudo -u nagios check_yum.

Nothing. No output, the command just hanged there. But CTRL+c worked without issues and I was back on the command line.

Back in the C/C++ days I had stumbled into a nice little tool called strace. Being as lazy as I am, I decided to try that tool out, before trying to go through all yum related logs.

$ sudo -su nagios
$ strace check_yum
...
poll([{fd=5, events=POLLIN|POLLPRI}], 1, -1) = 1 ([{fd=5, revents=POLLIN}])
read(5, "Existing lock /var/tmp/yum-nagio"..., 4096) = 332
poll([{fd=5, events=POLLIN|POLLPRI}], 1, -1) = 1 ([{fd=5, revents=POLLIN}])
read(5, "Another app is currently holding"..., 4096) = 236

And the last 4 lines kept repeating.

Well that was surprisingly easy! A lock-file (among other things) was in the directory /var/tmp/yum-nagios-2pv9qy/x86_64/7.

Now, I speculate that at some stage ‘check_yum’ process had been killed before it could remove the lock-file, causing the subsequent ‘check_yum’ commands to hang forever.

Solution

$ sudo rm -Rf /var/tmp/yum-nagios-2pv9qy