lookup_mnt deadlock on 2.6.32.32

Discuss usability issues, general maintenance, and general support issues for a grsecurity-enabled system.

lookup_mnt deadlock on 2.6.32.32

Postby wraith » Sat Apr 16, 2011 8:48 am

Hi,

The following deadlock was observed on all eight cores on a busy shell server running v2.6.32.32 with grsecurity patches and the BFS scheduler.

It was hinted this might be grsec related. Any ideas?

Code: Select all
BUG: soft lockup - CPU#0 stuck for 61s! [irssi:32373]
CPU 0:
Pid: 32373, comm: irssi Tainted: G      D    2.6.32.32-grsec-bfsha #1 ProLiant DL380 G5
RIP: 0010:[<ffffffff816d487a>]  [<ffffffff816d487a>] _spin_lock+0xa/0x20
RSP: 0018:ffff8801b5611c70  EFLAGS: 00000293
RAX: 0000000000007974 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff88041a92fa23 RDI: ffffffff8197d440
RBP: ffffffff81002f9e R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffff01
R13: ffffffff810d7f50 R14: ffff8801b5611b88 R15: ffff880100000000
FS:  00007f7285d176e0(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000f7794000 CR3: 00000001b76ed000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff810df760>] ? lookup_mnt+0x10/0x50
 [<ffffffff810cdbe0>] ? __follow_mount+0x80/0xa0
 [<ffffffff810cde23>] ? do_lookup+0x143/0x270
 [<ffffffff810cffde>] ? __link_path_walk+0x1ce/0x1230
 [<ffffffff8102f627>] ? __wake_up_common+0x47/0x80
 [<ffffffff810d131d>] ? path_walk+0x7d/0x100
 [<ffffffff810d14eb>] ? do_path_lookup+0x5b/0x60
 [<ffffffff810d2467>] ? user_path_at+0x97/0xd0
 [<ffffffff810c7dda>] ? cp_new_stat+0x16a/0x180
 [<ffffffff810c7be7>] ? vfs_fstatat+0x37/0x80
 [<ffffffff81058569>] ? ktime_get_ts+0x69/0xd0
 [<ffffffff810c7eaf>] ? sys_newstat+0x1f/0x50
 [<ffffffff810d6b33>] ? sys_poll+0x73/0x110
 [<ffffffff81003285>] ? device_not_available+0x15/0x20
 [<ffffffff8100259b>] ? system_call_fastpath+0x16/0x1b
BUG: soft lockup - CPU#1 stuck for 61s! [ninja:8891]
CPU 1:
Pid: 8891, comm: ninja Tainted: G      D    2.6.32.32-grsec-bfsha #1 ProLiant DL380 G5
RIP: 0010:[<ffffffff816d4880>]  [<ffffffff816d4880>] _spin_lock+0x10/0x20
RSP: 0018:ffff880413a77c80  EFLAGS: 00000297
RAX: 0000000000007574 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff88041dc21c64 RDI: ffffffff8197d440
RBP: ffffffff81002f9e R08: 0000000000000000 R09: ffff880075faad80
R10: 0000000000404633 R11: 0000000000000246 R12: ffffffff810e2d67
R13: ffff8800409012c0 R14: ffff880413a77d68 R15: ffff880413a77d78
FS:  00007fa7fed456e0(0000) GS:ffff880028240000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fa7fed5a000 CR3: 000000032dd23000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff810df760>] ? lookup_mnt+0x10/0x50
 [<ffffffff810cdbe0>] ? __follow_mount+0x80/0xa0
 [<ffffffff810cde23>] ? do_lookup+0x143/0x270
 [<ffffffff8109cab1>] ? handle_mm_fault+0x611/0x910
 [<ffffffff810cffde>] ? __link_path_walk+0x1ce/0x1230
 [<ffffffff810b796d>] ? get_partial_node+0x1d/0x90
 [<ffffffff810ba86c>] ? __slab_alloc+0x9c/0x490
 [<ffffffff810d131d>] ? path_walk+0x7d/0x100
 [<ffffffff810d14eb>] ? do_path_lookup+0x5b/0x60
 [<ffffffff810d27ab>] ? do_filp_open+0x10b/0xc80
 [<ffffffff810de2b9>] ? expand_files+0x49/0x260
 [<ffffffff810de51a>] ? alloc_fd+0x4a/0x140
 [<ffffffff810c159d>] ? do_sys_open+0x9d/0x160
 [<ffffffff8100259b>] ? system_call_fastpath+0x16/0x1b
BUG: soft lockup - CPU#2 stuck for 61s! [udevd:1801]
CPU 2:
Pid: 1801, comm: udevd Tainted: G      D    2.6.32.32-grsec-bfsha #1 ProLiant DL380 G5
RIP: 0010:[<ffffffff816d487a>]  [<ffffffff816d487a>] _spin_lock+0xa/0x20
RSP: 0018:ffff88041bea9cf0  EFLAGS: 00000293
RAX: 0000000000007674 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff88041dc20ca3 RDI: ffffffff8197d440
RBP: ffffffff81002f9e R08: 0000000000000000 R09: 00007fffc5cea155
R10: 0000000000000072 R11: 0000000000000297 R12: ffff88041e2f9488
R13: ffff88041e2f9478 R14: ffffffff81058569 R15: 000000021e2f9468
FS:  00007f9fe61aa770(0000) GS:ffff880028280000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000040d44c CR3: 000000041c7e2000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff810df760>] ? lookup_mnt+0x10/0x50
 [<ffffffff810cdbe0>] ? __follow_mount+0x80/0xa0
 [<ffffffff810cde23>] ? do_lookup+0x143/0x270
 [<ffffffff8108383e>] ? __lock_page+0x5e/0x70
 [<ffffffff810cffde>] ? __link_path_walk+0x1ce/0x1230
 [<ffffffff810d131d>] ? path_walk+0x7d/0x100
 [<ffffffff810d14eb>] ? do_path_lookup+0x5b/0x60
 [<ffffffff810d15c3>] ? user_path_parent+0x73/0xa0
 [<ffffffff810d1937>] ? do_unlinkat+0x47/0x290
 [<ffffffff81022ef2>] ? do_page_fault+0x162/0x390
 [<ffffffff8100259b>] ? system_call_fastpath+0x16/0x1b
BUG: soft lockup - CPU#3 stuck for 61s! [fail2ban-server:10251]
CPU 3:
Pid: 10251, comm: fail2ban-server Tainted: G      D    2.6.32.32-grsec-bfsha #1 ProLiant DL380 G5
RIP: 0010:[<ffffffff816d487e>]  [<ffffffff816d487e>] _spin_lock+0xe/0x20
RSP: 0018:ffff8802113c9c70  EFLAGS: 00000293
RAX: 0000000000007c74 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff88041a92fa23 RDI: ffffffff8197d440
RBP: ffffffff81002f9e R08: 0000000000000000 R09: 0000000000000000
R10: 00007f7854db2a50 R11: 0000000000000246 R12: ffff8802113c9be0
R13: ffffffff810500c0 R14: ffff8800039abe80 R15: 0000000000000000
FS:  0000000042495950(0063) GS:ffff8800282c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000018 CR3: 0000000412e7a000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff810df760>] ? lookup_mnt+0x10/0x50
 [<ffffffff810cdbe0>] ? __follow_mount+0x80/0xa0
 [<ffffffff810cde23>] ? do_lookup+0x143/0x270
 [<ffffffff810db739>] ? touch_atime+0x79/0x180
 [<ffffffff810cffde>] ? __link_path_walk+0x1ce/0x1230
 [<ffffffff810d131d>] ? path_walk+0x7d/0x100
 [<ffffffff810d14eb>] ? do_path_lookup+0x5b/0x60
 [<ffffffff810d2467>] ? user_path_at+0x97/0xd0
 [<ffffffff810ab3ad>] ? free_pages_and_swap_cache+0x9d/0xc0
 [<ffffffff81058569>] ? ktime_get_ts+0x69/0xd0
 [<ffffffff810c7be7>] ? vfs_fstatat+0x37/0x80
 [<ffffffff810c7eaf>] ? sys_newstat+0x1f/0x50
 [<ffffffff810d7cf3>] ? sys_select+0x63/0x190
 [<ffffffff8100259b>] ? system_call_fastpath+0x16/0x1b
BUG: soft lockup - CPU#4 stuck for 61s! [irssi:15169]
CPU 4:
Pid: 15169, comm: irssi Tainted: G      D    2.6.32.32-grsec-bfsha #1 ProLiant DL380 G5
RIP: 0010:[<ffffffff816d4880>]  [<ffffffff816d4880>] _spin_lock+0x10/0x20
RSP: 0018:ffff880250c79c70  EFLAGS: 00000297
RAX: 0000000000007874 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff88041a92fa23 RDI: ffffffff8197d440
RBP: ffffffff81002f9e R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffff01
R13: ffffffff810d7f50 R14: ffff880250c79b88 R15: ffff880200000000
FS:  00007f683d4e46e0(0000) GS:ffff880028300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000404112 CR3: 0000000322068000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff810df760>] ? lookup_mnt+0x10/0x50
 [<ffffffff810cdbe0>] ? __follow_mount+0x80/0xa0
 [<ffffffff810cde23>] ? do_lookup+0x143/0x270
 [<ffffffff810cffde>] ? __link_path_walk+0x1ce/0x1230
 [<ffffffff8102f627>] ? __wake_up_common+0x47/0x80
 [<ffffffff810d131d>] ? path_walk+0x7d/0x100
 [<ffffffff810d14eb>] ? do_path_lookup+0x5b/0x60
 [<ffffffff810d2467>] ? user_path_at+0x97/0xd0
 [<ffffffff81376f2b>] ? tty_write+0x37b/0x3a0
 [<ffffffff810c7be7>] ? vfs_fstatat+0x37/0x80
 [<ffffffff81058569>] ? ktime_get_ts+0x69/0xd0
 [<ffffffff810c7eaf>] ? sys_newstat+0x1f/0x50
 [<ffffffff810d6b33>] ? sys_poll+0x73/0x110
 [<ffffffff81003285>] ? device_not_available+0x15/0x20
 [<ffffffff8100259b>] ? system_call_fastpath+0x16/0x1b
BUG: soft lockup - CPU#5 stuck for 61s! [rpc.idmapd:5195]
CPU 5:
Pid: 5195, comm: rpc.idmapd Tainted: G      D    2.6.32.32-grsec-bfsha #1 ProLiant DL380 G5
RIP: 0010:[<ffffffff816d4880>]  [<ffffffff816d4880>] _spin_lock+0x10/0x20
RSP: 0018:ffff880413bcbc80  EFLAGS: 00000293
RAX: 0000000000007774 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff88041a92fa23 RDI: ffffffff8197d440
RBP: ffffffff81002f9e R08: 0000000000000000 R09: ffff88029b61a600
R10: 000000004da8327f R11: 0000000000000246 R12: ffff880413bcbfd8
R13: ffff880413bcbfd8 R14: ffff88041e20ea40 R15: 0000000000006d80
FS:  00007f7c2b28b6e0(0000) GS:ffff880028340000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f0188e0c000 CR3: 000000041b058000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff810df760>] ? lookup_mnt+0x10/0x50
 [<ffffffff810cdbe0>] ? __follow_mount+0x80/0xa0
 [<ffffffff810cde23>] ? do_lookup+0x143/0x270
 [<ffffffff810cffde>] ? __link_path_walk+0x1ce/0x1230
 [<ffffffff810500f0>] ? wake_bit_function+0x0/0x30
 [<ffffffff810d131d>] ? path_walk+0x7d/0x100
 [<ffffffff810d14eb>] ? do_path_lookup+0x5b/0x60
 [<ffffffff810d27ab>] ? do_filp_open+0x10b/0xc80
 [<ffffffff810de2b9>] ? expand_files+0x49/0x260
 [<ffffffff810de51a>] ? alloc_fd+0x4a/0x140
 [<ffffffff810c159d>] ? do_sys_open+0x9d/0x160
 [<ffffffff8100259b>] ? system_call_fastpath+0x16/0x1b
BUG: soft lockup - CPU#6 stuck for 61s! [irssi:21556]
CPU 6:
Pid: 21556, comm: irssi Tainted: G      D    2.6.32.32-grsec-bfsha #1 ProLiant DL380 G5
RIP: 0010:[<ffffffff816d4880>]  [<ffffffff816d4880>] _spin_lock+0x10/0x20
RSP: 0018:ffff880102d8bc70  EFLAGS: 00000297
RAX: 0000000000007a74 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff88041a92fa23 RDI: ffffffff8197d440
RBP: ffffffff81002f9e R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffff01
R13: ffffffff810d7f50 R14: ffff880102d8bb88 R15: ffff880000000000
FS:  00007f549bcfd6e0(0000) GS:ffff880028380000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f4e90b09000 CR3: 0000000102c7d000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff810df760>] ? lookup_mnt+0x10/0x50
 [<ffffffff810cdbe0>] ? __follow_mount+0x80/0xa0
 [<ffffffff810cde23>] ? do_lookup+0x143/0x270
 [<ffffffff810cffde>] ? __link_path_walk+0x1ce/0x1230
 [<ffffffff810d7f50>] ? pollwake+0x0/0x60
 [<ffffffff810d131d>] ? path_walk+0x7d/0x100
 [<ffffffff810d14eb>] ? do_path_lookup+0x5b/0x60
 [<ffffffff810d2467>] ? user_path_at+0x97/0xd0
 [<ffffffff810500c0>] ? autoremove_wake_function+0x0/0x30
 [<ffffffff810c7dda>] ? cp_new_stat+0x16a/0x180
 [<ffffffff810c7be7>] ? vfs_fstatat+0x37/0x80
 [<ffffffff81058569>] ? ktime_get_ts+0x69/0xd0
 [<ffffffff810c7eaf>] ? sys_newstat+0x1f/0x50
 [<ffffffff810d6b33>] ? sys_poll+0x73/0x110
 [<ffffffff81003285>] ? device_not_available+0x15/0x20
 [<ffffffff8100259b>] ? system_call_fastpath+0x16/0x1b
BUG: soft lockup - CPU#7 stuck for 61s! [links:15071]
CPU 7:
Pid: 15071, comm: links Tainted: G      D    2.6.32.32-grsec-bfsha #1 ProLiant DL380 G5
RIP: 0010:[<ffffffff816d4880>]  [<ffffffff816d4880>] _spin_lock+0x10/0x20
RSP: 0018:ffff88003e05bc80  EFLAGS: 00000297
RAX: 0000000000007b74 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff88041a9b6ca4 RDI: ffffffff8197d440
RBP: ffffffff81002f9e R08: 0000000000000000 R09: 0000000000000000
R10: 00000000004de515 R11: 0000000000000246 R12: 0000000000000000
R13: ffff88029aa660a8 R14: ffffffff8131f2a4 R15: 0000000000000010
FS:  00007f6edf0346e0(0000) GS:ffff8800283c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000a96000 CR3: 000000013d6a8000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff810df760>] ? lookup_mnt+0x10/0x50
 [<ffffffff810cdbe0>] ? __follow_mount+0x80/0xa0
 [<ffffffff810cde23>] ? do_lookup+0x143/0x270
 [<ffffffff810cffde>] ? __link_path_walk+0x1ce/0x1230
 [<ffffffff810d131d>] ? path_walk+0x7d/0x100
 [<ffffffff810d14eb>] ? do_path_lookup+0x5b/0x60
 [<ffffffff810d27ab>] ? do_filp_open+0x10b/0xc80
 [<ffffffff816d28d1>] ? thread_return+0x61/0x430
 [<ffffffff810d909e>] ? dput+0xae/0x160
 [<ffffffff810de2b9>] ? expand_files+0x49/0x260
 [<ffffffff810de51a>] ? alloc_fd+0x4a/0x140
 [<ffffffff810c159d>] ? do_sys_open+0x9d/0x160
 [<ffffffff8100259b>] ? system_call_fastpath+0x16/0x1b


Perhaps the RAX registers give some hint as to which core got stuck first:
Code: Select all
cpu0:RAX: 0000000000007974
cpu1:RAX: 0000000000007574
cpu2:RAX: 0000000000007674
cpu3:RAX: 0000000000007c74
cpu4:RAX: 0000000000007874
cpu5:RAX: 0000000000007774
cpu6:RAX: 0000000000007a74
cpu7:RAX: 0000000000007b74


Any help would be much appreciated.

Edit: Added stack traces from other cores
wraith
 
Posts: 2
Joined: Sat Apr 16, 2011 6:58 am

Re: lookup_mnt deadlock on 2.6.32.32

Postby PaX Team » Mon Apr 18, 2011 11:21 am

wraith wrote:The following deadlock was observed on all eight cores on a busy shell server running v2.6.32.32 with grsecurity patches and the BFS scheduler.
the obvious first question is if you can reproduce it without BFS as well ;). second, can you enable the lock debugging related options and still reproduce the deadlock? that'd give us all the info to track this down hopefully.
PaX Team
 
Posts: 2310
Joined: Mon Mar 18, 2002 4:35 pm

Re: lookup_mnt deadlock on 2.6.32.32

Postby wraith » Mon Apr 18, 2011 2:59 pm

PaX Team wrote:the obvious first question is if you can reproduce it without BFS as well ;). second, can you enable the lock debugging related options and still reproduce the deadlock? that'd give us all the info to track this down hopefully.


Unfortunately this bug seems to be as hard to reproduce as it is annoying. This occured once last week and in February before that. At that time it seemed it might be related to NFS but perhaps not, comparing the stack traces now.

As for BFS, it should have very little to do with the vfs layer though of course it might bring out some lurking race conditions or incorrect locking semantics by altering the scheduling patterns.

Any ideas on how to try to stress test it on a testing machine?
wraith
 
Posts: 2
Joined: Sat Apr 16, 2011 6:58 am

Re: lookup_mnt deadlock on 2.6.32.32

Postby PaX Team » Tue Apr 19, 2011 6:54 am

wraith wrote:Unfortunately this bug seems to be as hard to reproduce as it is annoying. This occured once last week and in February before that. At that time it seemed it might be related to NFS but perhaps not, comparing the stack traces now.

As for BFS, it should have very little to do with the vfs layer though of course it might bring out some lurking race conditions or incorrect locking semantics by altering the scheduling patterns.

Any ideas on how to try to stress test it on a testing machine?
well, since grsec doesn't show up in the backtraces, it's hard to tell who's to blame here. and i assume you don't have a vanilla kernel around to reproduce this issue. so it's a tough call, my best advice is still lockdep but that will cost you some performance and may make it even harder to hit the deadlock... as for the workload, seemingly anything that exercises the VFS path walk code is likely needed to reach the deadlock, so you can try fs benchmarks or any other app that does many path lookups.
PaX Team
 
Posts: 2310
Joined: Mon Mar 18, 2002 4:35 pm


Return to grsecurity support

cron