jump to navigation

Too many interruptions June 7, 2010

Posted by vneil in Linux, VMware.
5 comments

We had a strange problem that I noticed the other day, a few of our Linux virtual machines had High CPU alarms in vCenter. Looking into the servers themselves they were completely idle, they had been installed but no applications were running. The CPU stats in Linux showed 99% idle but the performance tab in vCenter for the virtual machine showed over 95% CPU busy. Then I looked at the interrupts per second counter from mpstat:

# mpstat 5 5
Linux 2.6.16.60-0.21-smp (xxxx) 	05/31/10
CPU   %user   %nice    %sys   %steal   %idle    intr/s
all    0.00    0.00    0.60    0.00   99.40   154596.40

A lot of interrupts (intr/s) for an idle system, compare this to a ‘normal’ busy system:

$ mpstat 5
Linux 2.6.16.60-0.21-smp (xxxx) 	06/03/2010
CPU   %user   %nice    %sys   %steal   %idle    intr/s
all   81.24    0.00   18.36    0.00    0.00     9832.73

We have around 120 Linux virtual machines and I was only seeing this on about 5 systems. Looking at the output from procinfo it easy to see the interrupts are timer interrupts:

# procinfo 
Linux 2.6.16.60-0.21-smp (geeko@buildhost) (gcc 4.1.2 20070115) #1 SMP Tue May 6 12:41:02 UTC 2008 1CPU [xxxxx]

Memory:      Total        Used        Free      Shared     Buffers      
Mem:       3867772      494576     3373196           0       11228
Swap:      4194296           0     4194296

Bootup: Mon May 31 14:24:38 2010    Load average: 0.00 0.16 0.18 2/152 5427

user  :       0:00:19.94   2.8%  page in :    1054996  disk 1:    26937r    7117w
nice  :       0:00:05.30   0.7%  page out:      28856  disk 2:      124r       0w
system:       0:00:20.78   2.9%  page act:      30744
IOwait:       0:01:28.12  12.2%  page dea:          0
hw irq:       0:00:01.38   0.2%  page flt:    1514598
sw irq:       0:00:01.32   0.2%  swap in :          0
idle  :       0:09:32.81  79.6%  swap out:          0
uptime:       0:11:59.66         context :     297632

irq  0: 101039309 timer                 irq  9:         0 acpi                 
irq  1:         9 i8042                 irq 12:       114 i8042                
irq  3:         1                       irq 14:      5660 ide0                 
irq  4:         1                       irq169:     30998 ioc0                 
irq  6:         5                       irq177:      1113 eth0                 
irq  8:         0 rtc                   irq185:      2480 eth1 

I picked one and just rebooted it from the command line in Linux. After the reboot it still showed the same symptoms, a very high intr/s. Shutting down all services and daemons still didn’t help. I then shutdown and powered off the VM and then after the power on it was ok, the high interrupts per second had stopped.

We then experimented with trying a vmotion with one of these systems which had a high intr/s, this also seemed to work. “Good!” we thought a non-interruptive fix but when we tried with vmotion on another server it didn’t fix the problem. Then trying a vmotion again on the same server fixed it ?!

It seems that there is no pattern in fixing this, a cold reboot works or multiple vmotions sometimes helps. All these systems were SLES servers with SP2 running the SMP Linux kernel even though most only had one vCPU (but not all).

Definitely a strange one. I’ll keep an eye on this.