We've been having problems with deadlocks every few months on a
popular (~100 concurrent users) ssh server running grsecurity over the
last year and a half (Linux kernel versions increasing between 2.6.16
and 2.6.23). Recently, we recompiled our kernel with spinlock
debugging and set up a serial console to the machine and determined
that 3 of the processors were in spinlock lockup, and the fourth was
not responding. Investigation of the code suggests that the lock all
three processors were spinning on was the sighand->siglock lock of
some processes' signal handler struct. The complete backtraces that we
have from these CPUs is below.
A brief search of the grsecurity patch suggests that grsecurity has
some large blocks of code involved in logging signals while holding
the siglock (in particular, force_sig_info calls gr_log_signal, which
generates and potentially prints log messages, with the siglock held).
Our current best guess is that some race condition or rarely exercised
code path in this logging code is causing these deadlocks.
These deadlocks are typically immediately preceeded by waves of
grsecurity segfault notifications to the system logs. (We are aware
that these segfaults are not generated by grsecurity).
We are not entirely certain that grsecurity is at fault here, because
we are also running an openafs kernel module on the machine, but AFS
only has a small number of ~6-line blocks of code holding the siglock,
and they all seem to be safe. Grsecurity thus seems more likely to be
responsible.
Let us know if any other information would be useful.
[I sent this message to grsecurity@grsecurity.net about a month ago; We received no reply, and the archives at http://grsecurity.net/pipermail/grsecurity/ trail off in October, so we're not sure if that list is still active]
- Code: Select all
[Fri Nov 23 16:09:17 2007]BUG: spinlock lockup on CPU#1, fileziOSyb/28366, ffff81019fbdd108
[Fri Nov 23 16:09:17 2007]
[Fri Nov 23 16:09:17 2007]Call Trace:
[Fri Nov 23 16:09:17 2007] [<ffffffff803458c7>] _raw_spin_lock+0x117/0x150
[Fri Nov 23 16:09:17 2007] [<ffffffff8052f610>] _spin_lock+0x50/0x70
[Fri Nov 23 16:09:17 2007] [<ffffffff802314a1>] release_task+0x91/0x340
[Fri Nov 23 16:09:17 2007] [<ffffffff80232d2e>] do_exit+0x84e/0x990
[Fri Nov 23 16:09:17 2007] [<ffffffff8052daa9>] mutex_unlock+0x9/0x10
[Fri Nov 23 16:09:17 2007] [<ffffffff80232f33>] sys_exit+0x13/0x20
[Fri Nov 23 16:09:17 2007] [<ffffffff8021d472>] ia32_sysret+0x0/0xa
[Fri Nov 23 16:09:17 2007]
[Fri Nov 23 16:09:17 2007]BUG: spinlock lockup on CPU#0, fileziOSyb/28359, ffff81019fbdd108
[Fri Nov 23 16:09:17 2007]
[Fri Nov 23 16:09:17 2007]Call Trace:
[Fri Nov 23 16:09:17 2007] [<ffffffff803458c7>] _raw_spin_lock+0x117/0x150
[Fri Nov 23 16:09:17 2007] [<ffffffff8052f9f1>] _spin_lock_irq+0x51/0x70
[Fri Nov 23 16:09:17 2007] [<ffffffff802392a8>] sigprocmask+0x38/0xf0
[Fri Nov 23 16:09:17 2007] [<ffffffff8023c31e>] sys_rt_sigprocmask+0x7e/0x100
[Fri Nov 23 16:09:17 2007] [<ffffffff8021e107>] sys32_rt_sigprocmask+0x77/0x110
[Fri Nov 23 16:09:17 2007] [<ffffffff8021deb6>] sys32_mmap2+0x76/0xf0
[Fri Nov 23 16:09:17 2007] [<ffffffff8021d472>] ia32_sysret+0x0/0xa
[Fri Nov 23 16:09:17 2007]
[Fri Nov 23 16:10:53 2007]BUG: spinlock lockup on CPU#3, fileziOSyb/2959, ffff81019fbdd108
[Fri Nov 23 16:10:53 2007]
[Fri Nov 23 16:10:53 2007]Call Trace:
[Fri Nov 23 16:10:53 2007] [<ffffffff803458c7>] _raw_spin_lock+0x117/0x150
[Fri Nov 23 16:10:53 2007] [<ffffffff8052f9f1>] _spin_lock_irq+0x51/0x70
[Fri Nov 23 16:10:53 2007] [<ffffffff802392a8>] sigprocmask+0x38/0xf0
[Fri Nov 23 16:10:53 2007] [<ffffffff8023c31e>] sys_rt_sigprocmask+0x7e/0x100
[Fri Nov 23 16:10:53 2007] [<ffffffff80251a91>] compat_sys_futex+0x71/0x110
[Fri Nov 23 16:10:53 2007] [<ffffffff8021e107>] sys32_rt_sigprocmask+0x77/0x110
[Fri Nov 23 16:10:53 2007] [<ffffffff8021d472>] ia32_sysret+0x0/0xa
[Fri Nov 23 16:10:53 2007]