vm.panic_on_oom not quite working

Discuss usability issues, general maintenance, and general support issues for a grsecurity-enabled system.

vm.panic_on_oom not quite working

Postby gaima » Thu Mar 18, 2010 8:37 am

Hi guys,

Following on the from excellent work done by "PaX Team" and spender, I'm happily using a grsec/pax enable kernel as a Xen PV guest.
However, I have one problem with it.

We have these sysctl values set on our servers, so the HA will kick in on any OOM or kernel panic situation.

Code: Select all
kernel.panic = 1
kernel.panic_on_oops = 1
vm.panic_on_oom = 2


It's better for us for the HA slave to take over, than having the master live but unusable.

Even though it's set to reboot, all our 2.6.21-xen machines just die if they run out of RAM or panic. Mildly annoying, but the service keeps running, and the monitoring quickly and blindly obviously shows the machine is down.
I've got a 2.6.31-xen kernel floating around, and that does actually cause the VM to reboot.
My problem is that the 2.6.32-hardened (grsecurity-2.1.14-2.6.32.8-201002200811) kernel neither reboots, or dies!

Code: Select all
[   87.604075] Kernel panic - not syncing: out of memory. Compulsory panic_on_oom is selected.
[   87.604078]
[   87.604180] Rebooting in 1 seconds..


And it just sits there, "xm list" on the dom0 shows it consuming CPU time at about 1.3s per second (dom0 is a dual quad-core).
The VM is completely unusable, but still responds to pings. Which is precisely the worst way it could possible fail.
xm shutdown doesn't do anything either, I have to destroy it.

I have CONFIG_GRKERNSEC_HIGH=y set, xen logs nothing, and xm dmesg is unchanged.

Can anyone help me out here?

Thanks
Mike
gaima
 
Posts: 27
Joined: Fri Feb 12, 2010 12:17 pm

Re: vm.panic_on_oom not quite working

Postby PaX Team » Thu Mar 18, 2010 9:33 am

gaima wrote:And it just sits there, "xm list" on the dom0 shows it consuming CPU time at about 1.3s per second (dom0 is a dual quad-core).
The VM is completely unusable, but still responds to pings. Which is precisely the worst way it could possible fail.
xm shutdown doesn't do anything either, I have to destroy it.
the guest kernel must be spinning in some code, so we should figure out where it is exactly. since i'm not familiar with 'remote' xen guest kernel debugging, can someone give us some advice as to how to obtain a register/backtrace dump from such a stuck kernel? the alternative and slow method would be to add printk's to the reboot path and see how far we get, but that's tedious, not sure you have the time and motivation for it ;).
PaX Team
 
Posts: 2310
Joined: Mon Mar 18, 2002 4:35 pm

Re: vm.panic_on_oom not quite working

Postby gaima » Thu Mar 18, 2010 11:36 am

PaX Team wrote:the guest kernel must be spinning in some code, so we should figure out where it is exactly. since i'm not familiar with 'remote' xen guest kernel debugging, can someone give us some advice as to how to obtain a register/backtrace dump from such a stuck kernel? the alternative and slow method would be to add printk's to the reboot path and see how far we get, but that's tedious, not sure you have the time and motivation for it ;).


With magic sysrq enabled you can send sysrq commands to a VM with xm (xm sysreq).
Unfortunately when it's spinning even sysrq commands are ignored/lost.

I did get this though when asking for a backtrace while the VM was actually running.

Code: Select all
[   84.265510] SysRq : Show backtrace of all active CPUs
[   84.265552] BUG: unable to handle kernel paging request at ffffffffff5fc310
[   84.265563] IP: [<ffffffff8101df1b>]
[   84.265571] PGD 136c067 PUD 136e067 PMD 14db067 PTE 0
[   84.265583] Oops: 0002 [#1] SMP
[   84.265591] last sysfs file: /sys/devices/virtual/block/md0/dev
[   84.265597] CPU 2
[   84.265603] Modules linked in: ipv6 nfsd lockd auth_rpcgss sunrpc exportfs usbcore dm_mod
[   84.265627] Pid: 41, comm: xenwatch Not tainted 2.6.32-hardened-r4 #2
[   84.265634] RIP: e030:[<ffffffff8101df1b>]  [<ffffffff8101df1b>]
[   84.265643] RSP: e02b:ffff88007f45bdb0  EFLAGS: 00010016
[   84.265649] RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000004ed5
[   84.265655] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffffffff815f0070
[   84.265662] RBP: 0000000000000800 R08: 0000000000000008 R09: 0000000000000054
[   84.265669] R10: ffff88007f45bd00 R11: 000000007ffffff2 R12: 000000000f000000
[   84.265678] R13: 0000000000000003 R14: ffff88007f45024f R15: 0000000000000001
[   84.265691] FS:  0000688a73a006f0(0000) GS:ffff880001e7a000(0000) knlGS:0000000000000000
[   84.265700] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[   84.265706] CR2: ffffffffff5fc310 CR3: 000000007caab000 CR4: 0000000000000660
[   84.265713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   84.265720] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   84.265727] Process xenwatch (pid: 41, threadinfo ffff88007f45a000, task ffff88007f4506d0)
[   84.265734] Stack:
[   84.265740]  0000000000000000 ffffffff815f0070 0000000000000000 0000ffffffffffff
[   84.265751] <0> ffffffff816298d0 ffffffff8101b6da 0000000000000000 000000000000006c
[   84.265764] <0> 0000000000000000 ffffffff81230395 ffffffff81345e74 ffff88007f45be4f


The call trace is just lots of ?s.

Mike
gaima
 
Posts: 27
Joined: Fri Feb 12, 2010 12:17 pm

Re: vm.panic_on_oom not quite working

Postby PaX Team » Thu Mar 18, 2010 12:34 pm

gaima wrote:With magic sysrq enabled you can send sysrq commands to a VM with xm (xm sysreq).
Unfortunately when it's spinning even sysrq commands are ignored/lost.
problem is that on the reboot path more and more parts of the kernel are shut down, so we can't really rely on normal debugging mechanisms ;). btw, i forgot to ask but does domU reboot properly when you initiate it normally? i.e., is the problem only with the forced panic/reboot case?
I did get this though when asking for a backtrace while the VM was actually running.
can you get me the vmlinux corresponding to this oops? also, which exact grsec patch was applied here?
PaX Team
 
Posts: 2310
Joined: Mon Mar 18, 2002 4:35 pm

Re: vm.panic_on_oom not quite working

Postby gaima » Thu Mar 18, 2010 12:57 pm

PaX Team wrote:
gaima wrote:With magic sysrq enabled you can send sysrq commands to a VM with xm (xm sysreq).
Unfortunately when it's spinning even sysrq commands are ignored/lost.
problem is that on the reboot path more and more parts of the kernel are shut down, so we can't really rely on normal debugging mechanisms ;). btw, i forgot to ask but does domU reboot properly when you initiate it normally? i.e., is the problem only with the forced panic/reboot case?


Regular rebooting works perfectly...

can you get me the vmlinux corresponding to this oops? also, which exact grsec patch was applied here?


vmlinux supplied by PM.
The patch set is http://distfiles.gentoo.org/distfiles/h ... as.tar.bz2, which includes grsecurity-2.1.14-2.6.32.8-201002200811.


Thanks
Mike
gaima
 
Posts: 27
Joined: Fri Feb 12, 2010 12:17 pm

Re: vm.panic_on_oom not quite working

Postby PaX Team » Thu Mar 18, 2010 4:18 pm

gaima wrote:vmlinux supplied by PM.
The patch set is http://distfiles.gentoo.org/distfiles/h ... as.tar.bz2, which includes grsecurity-2.1.14-2.6.32.8-201002200811.
can you try a newer grsec or PaX patch? not that i expect any changes, it's just always better to test the latest and (perhaps not so) greatest ;). the oops itself was in flat_send_IPI_mask which tried to write to an APIC register that apparently wasn't even mapped. i have no idea how xen handles the APIC, so can't tell how the kernel ended up (and failed at) accessing the native APIC... can you test vanilla 32.9 with your config to see if it works at least?
PaX Team
 
Posts: 2310
Joined: Mon Mar 18, 2002 4:35 pm

Re: vm.panic_on_oom not quite working

Postby gaima » Fri Mar 19, 2010 6:47 am

PaX Team wrote:can you try a newer grsec or PaX patch? not that i expect any changes, it's just always better to test the latest and (perhaps not so) greatest ;). the oops itself was in flat_send_IPI_mask which tried to write to an APIC register that apparently wasn't even mapped. i have no idea how xen handles the APIC, so can't tell how the kernel ended up (and failed at) accessing the native APIC... can you test vanilla 32.9 with your config to see if it works at least?


Vanilla 2.6.32.10, and 2.6.33.1 are also afflicted in the same way.
grsec and PAX are not to blame. It's not even a Xen thing, as I had the same problem on a real machine (although that was with the same kernel as I came here with).
Sorry for the noise.


Mike
gaima
 
Posts: 27
Joined: Fri Feb 12, 2010 12:17 pm


Return to grsecurity support

cron