Too many interruptions June 7, 2010
Posted by vneil in Linux, VMware.trackback
We had a strange problem that I noticed the other day, a few of our Linux virtual machines had High CPU alarms in vCenter. Looking into the servers themselves they were completely idle, they had been installed but no applications were running. The CPU stats in Linux showed 99% idle but the performance tab in vCenter for the virtual machine showed over 95% CPU busy. Then I looked at the interrupts per second counter from mpstat:
# mpstat 5 5 Linux 2.6.16.60-0.21-smp (xxxx) 05/31/10 CPU %user %nice %sys %steal %idle intr/s all 0.00 0.00 0.60 0.00 99.40 154596.40
A lot of interrupts (intr/s) for an idle system, compare this to a ‘normal’ busy system:
$ mpstat 5 Linux 2.6.16.60-0.21-smp (xxxx) 06/03/2010 CPU %user %nice %sys %steal %idle intr/s all 81.24 0.00 18.36 0.00 0.00 9832.73
We have around 120 Linux virtual machines and I was only seeing this on about 5 systems. Looking at the output from procinfo it easy to see the interrupts are timer interrupts:
# procinfo Linux 2.6.16.60-0.21-smp (geeko@buildhost) (gcc 4.1.2 20070115) #1 SMP Tue May 6 12:41:02 UTC 2008 1CPU [xxxxx] Memory: Total Used Free Shared Buffers Mem: 3867772 494576 3373196 0 11228 Swap: 4194296 0 4194296 Bootup: Mon May 31 14:24:38 2010 Load average: 0.00 0.16 0.18 2/152 5427 user : 0:00:19.94 2.8% page in : 1054996 disk 1: 26937r 7117w nice : 0:00:05.30 0.7% page out: 28856 disk 2: 124r 0w system: 0:00:20.78 2.9% page act: 30744 IOwait: 0:01:28.12 12.2% page dea: 0 hw irq: 0:00:01.38 0.2% page flt: 1514598 sw irq: 0:00:01.32 0.2% swap in : 0 idle : 0:09:32.81 79.6% swap out: 0 uptime: 0:11:59.66 context : 297632 irq 0: 101039309 timer irq 9: 0 acpi irq 1: 9 i8042 irq 12: 114 i8042 irq 3: 1 irq 14: 5660 ide0 irq 4: 1 irq169: 30998 ioc0 irq 6: 5 irq177: 1113 eth0 irq 8: 0 rtc irq185: 2480 eth1
I picked one and just rebooted it from the command line in Linux. After the reboot it still showed the same symptoms, a very high intr/s. Shutting down all services and daemons still didn’t help. I then shutdown and powered off the VM and then after the power on it was ok, the high interrupts per second had stopped.
We then experimented with trying a vmotion with one of these systems which had a high intr/s, this also seemed to work. “Good!” we thought a non-interruptive fix but when we tried with vmotion on another server it didn’t fix the problem. Then trying a vmotion again on the same server fixed it ?!
It seems that there is no pattern in fixing this, a cold reboot works or multiple vmotions sometimes helps. All these systems were SLES servers with SP2 running the SMP Linux kernel even though most only had one vCPU (but not all).
Definitely a strange one. I’ll keep an eye on this.
Could that be a timer interrupt issue? maybe it’s too high for a VM? Often the default is 1000 for many distro’s 😦
Try booting with “divider=10” to lower it down to 100 which far enough for a VM 🙂
Cheers,
Didier
Thanks Didier, I saw that parameter mentioned in KB1006427 with regard to RedHat systems and though about trying it. The problem is of course this seems to be a bit random on which server it affects. At the moment we don’t have any problems 🙂
I will carry on investigating and maybe post a follow-up article if I find anything.
Cheers,
Neil.
divider=10 is the workaround like was mentioned, I have seen this at several sites and this fixed it. The SR says will be fixed in 4.1
Hi Rod,
When you say 4.1 do you mean vSphere 4.1 ?
Cheers,
Neil.
It seems this is fixed in ESXi4.0 U2. Haven’t seen any occurrences since upgrading.