Page 1 of 1

Random hard lockups under cpu load

PostPosted: Tue Nov 01, 2011 11:51 am
by ramereth
I've been having an ongoing problem since around 2.6.32 where some of our machines that are under a high cpu load get into a hard locked state. We get no output anywhere and its been extremely difficult to get any sort of output from the kernel. So far we've figured out that it seems to only happen on our grsec kernels and we're currently running 2.6.39-hardened-r8 on Gentoo Hardened. Are there any known potential issues where grsec will hard lock a machine under high cpu load? What's the best method to getting debug output on a hardened kernel? I'm at a loss as to what do to next.

Thanks-

Re: Random hard lockups under cpu load

PostPosted: Thu Nov 03, 2011 6:57 am
by PaX Team
ramereth wrote:I've been having an ongoing problem since around 2.6.32 where some of our machines that are under a high cpu load get into a hard locked state. We get no output anywhere and its been extremely difficult to get any sort of output from the kernel. So far we've figured out that it seems to only happen on our grsec kernels and we're currently running 2.6.39-hardened-r8 on Gentoo Hardened. Are there any known potential issues where grsec will hard lock a machine under high cpu load? What's the best method to getting debug output on a hardened kernel? I'm at a loss as to what do to next.
if these lockups are more or less reproducible, can you try one of the currently supported grsec patches (2.6.32 or 3.0, soon 3.1, instead of .39)? second, what is your .config (in particular, is it 32/64 bit x86? what PaX features are on, etc)? as for debugging, you should probably enable some lock debugging options (if your workload can tolerate the associated performance impact) and the NMI watchdog. also logging through netconsole may be able to catch the resulting logs but if you can attach a monitor and take a shot that'll do too.

Re: Random hard lockups under cpu load

PostPosted: Fri Nov 04, 2011 12:34 pm
by ramereth
PaX Team wrote:if these lockups are more or less reproducible, can you try one of the currently supported grsec patches (2.6.32 or 3.0, soon 3.1, instead of .39)? second, what is your .config (in particular, is it 32/64 bit x86? what PaX features are on, etc)? as for debugging, you should probably enable some lock debugging options (if your workload can tolerate the associated performance impact) and the NMI watchdog. also logging through netconsole may be able to catch the resulting logs but if you can attach a monitor and take a shot that'll do too.


They are fairly reproducible when I do several concurrent stage4 builds (i.e. using chroots extensively) on 64bit. I've been able to narrow down the issue only tripping when I have some of the chroot grsec features enabled. Here's the list of features I have enabled:

Code: Select all
kernel.grsecurity.chroot_deny_shmat = 1
kernel.grsecurity.chroot_deny_unix = 1
kernel.grsecurity.chroot_deny_mount = 1
kernel.grsecurity.chroot_deny_fchdir = 1
kernel.grsecurity.chroot_deny_chroot = 1
kernel.grsecurity.chroot_deny_pivot = 1
kernel.grsecurity.chroot_enforce_chdir = 1
kernel.grsecurity.chroot_deny_mknod = 1
kernel.grsecurity.chroot_restrict_nice = 1
kernel.grsecurity.chroot_deny_sysctl = 1
kernel.grsecurity.chroot_findtask = 1


I have had no luck getting output from netconsole when the machine locks up. I'm going to try and narrow down which chroot feature is actually causing it. I'll also try 3.0.x and get you the kernel config soon. If chroot is the culprit it would align with the hosts that appear to have this problem. Our ftp server uses chroot for the rsyncd config so I suspect some syncs may be triggering it. I haven't had this happen on many other hosts other than those that use chroots and may have high load sometimes.

Re: Random hard lockups under cpu load

PostPosted: Sun Dec 04, 2011 7:18 pm
by spender
Hi sir,

I believe this problem to be fixed in the latest 3.1.4 patch. It had to do with a lock being acquired by the chroot fchdir code while the kernel was performing an RCU-walk. I believe we had a workaround for you that you've been using, but just wanted to update that the feature is safe to use again.

Thanks,
-Brad