Kernel oops: rcu_sched detected stalls on CPUs/tasks
Posted: Sun Jul 27, 2014 11:40 am
Since a while we have a problem on a system running a grsecurity kernel. It occurs irregularly and we are not able to reproduce the incident. Kernel is 3.15.5 with grsec patch from 201407100036, however we've been seeing this for a while.
We regularly get either a complete system hang or the system becomes almost unusable with segfaults etc. We see messages like the one below in dmesg.
The system is a virtual kvm machine running on an Intel Xeon system. I can give further info if needed. I found this older forum entry which seems to have somewhat simliar stack traces:
viewtopic.php?t=3060&p=12154
Any ideas what I could do to debug further? As said, it happens irregularly which makes debugging harder.
Jul 26 05:02:01 milch kernel: INFO: rcu_sched detected stalls on CPUs/tasks: { 1} (detected by 0, t=6002 jiffies, g=6339563, c=6339562, q=6087)
Jul 26 05:02:01 milch kernel: sending NMI to all CPUs:
Jul 26 05:02:01 milch kernel: NMI backtrace for cpu 0
Jul 26 05:02:01 milch kernel: CPU: 0 PID: 20938 Comm: wc Not tainted 3.15.5-grsec #1
Jul 26 05:02:01 milch kernel: Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Jul 26 05:02:01 milch kernel: task: ffff880039188000 ti: ffff880039188510 task.ti: ffff880039188510
Jul 26 05:02:01 milch kernel: RIP: 0010:[<ffffffff810268ef>] [<ffffffff810268ef>] ffffffff810268ef
Jul 26 05:02:01 milch kernel: RSP: 0000:ffff88007fc03e28 EFLAGS: 00000046
Jul 26 05:02:01 milch kernel: RAX: 0000000000000c00 RBX: 0000000000000002 RCX: 0000000000000007
Jul 26 05:02:01 milch kernel: RDX: ffffffff81501640 RSI: 0000000000000002 RDI: 0000000000000086
Jul 26 05:02:01 milch kernel: RBP: 0000000000000c00 R08: ffffffff816a58c4 R09: 0000000000000000
Jul 26 05:02:01 milch kernel: R10: 0000000000000000 R11: ffffffff81026700 R12: 0000000000000003
Jul 26 05:02:01 milch kernel: R13: ffffffff81546f40 R14: 0000000000000000 R15: ffffffff81547000
Jul 26 05:02:01 milch kernel: FS: 000003fe22b6c700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
Jul 26 05:02:01 milch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 26 05:02:01 milch kernel: CR2: 000003fe226cf000 CR3: 00000000013d1000 CR4: 00000000000407f0
Jul 26 05:02:01 milch kernel: Stack:
Jul 26 05:02:01 milch kernel: ffff88007fc0c968 0000000000000086 0000000000002710 ffff88007fc0c470
Jul 26 05:02:01 milch kernel: 0000000000000001 ffffffff81024390 ffffffff81546f40 ffffffff810923c8
Jul 26 05:02:01 milch kernel: 00000000ffffffff 00000000000017c7 0000000000000083 ffffffff8157d740
Jul 26 05:02:01 milch kernel: Call Trace:
Jul 26 05:02:01 milch kernel: <IRQ>
Jul 26 05:02:01 milch kernel: [<ffffffff81024390>] ? arch_trigger_all_cpu_backtrace+0x42/0x6a
Jul 26 05:02:01 milch kernel: [<ffffffff810923c8>] ? rcu_check_callbacks+0x40b/0x4e5
Jul 26 05:02:01 milch kernel: [<ffffffff81065870>] ? update_process_times+0x31/0x56
Jul 26 05:02:01 milch kernel: [<ffffffff81099800>] ? tick_sched_timer+0x30/0x4c
Jul 26 05:02:01 milch kernel: [<ffffffff81077404>] ? __run_hrtimer.isra.27+0x50/0xa9
Jul 26 05:02:01 milch kernel: [<ffffffff81026717>] ? default_inquire_remote_apic+0xf/0xf
Jul 26 05:02:01 milch kernel: [<ffffffff81077a9b>] ? hrtimer_interrupt+0xd9/0x1d4
Jul 26 05:02:01 milch kernel: [<ffffffff810231f3>] ? smp_apic_timer_interrupt+0x36/0x46
Jul 26 05:02:01 milch kernel: [<ffffffff813c6c26>] ? apic_timer_interrupt+0x86/0x90
Jul 26 05:02:01 milch kernel: <EOI>
Jul 26 05:02:01 milch kernel: Code: 5f ff f6 c4 10 75 f2 44 89 e0 c1 e0 18 89 04 25 10 a3 5f ff 89 d8 09 e8 81 cd 00 04 00 00 83 fb 02 0f 44 c5 89 04 25 00 a3 5f ff <57> 9d 66 66 90 66 90 48 83 c4 10 5b 5d 41 5c c3 41 54 55 53 89
Jul 26 05:02:01 milch kernel: NMI backtrace for cpu 1
Jul 26 05:02:01 milch kernel: CPU: 1 PID: 20893 Comm: pidof Not tainted 3.15.5-grsec #1
Jul 26 05:02:01 milch kernel: Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Jul 26 05:02:01 milch kernel: task: ffff88007d388fa0 ti: ffff88007d3894b0 task.ti: ffff88007d3894b0
Jul 26 05:02:01 milch kernel: RIP: 0010:[<ffffffff813c4f04>] [<ffffffff813c4f04>] ffffffff813c4f04
Jul 26 05:02:01 milch kernel: RSP: 0018:ffff880003a8bc90 EFLAGS: 00000097
Jul 26 05:02:01 milch kernel: RAX: 0000000000002502 RBX: ffff8800390439c0 RCX: 0000000000000010
Jul 26 05:02:01 milch kernel: RDX: 0000000000000025 RSI: ffff880003a8bd70 RDI: ffff8800390441c8
Jul 26 05:02:01 milch kernel: RBP: ffff880003a8bd70 R08: 0000000000000001 R09: 0000000000000001
Jul 26 05:02:01 milch kernel: R10: 00000000fffffffc R11: ffff88007b06c650 R12: ffff88007b06c650
Jul 26 05:02:01 milch kernel: R13: ffff88007d32b200 R14: ffffffff81545080 R15: 0000000000000000
Jul 26 05:02:01 milch kernel: FS: 000003a3a5bda700(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
Jul 26 05:02:01 milch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 26 05:02:01 milch kernel: CR2: 000003a3a5df8038 CR3: 00000000013d3000 CR4: 00000000000407f0
Jul 26 05:02:01 milch kernel: Stack:
Jul 26 05:02:01 milch kernel: ffffffff81067580 ffff88002bffa550 ffff88007b06c650 0000000000000000
Jul 26 05:02:01 milch kernel: ffffffff8112e320 ffff88007d0db8e0 ffffffff81545080 000000017c824c80
Jul 26 05:02:01 milch kernel: ffffffff810ebbc9 ffff88007d10002c ffff880003a8be20 ffff88007d10002c
Jul 26 05:02:01 milch kernel: Call Trace:
Jul 26 05:02:01 milch kernel: [<ffffffff81067580>] ? __lock_task_sighand+0x45/0x6d
Jul 26 05:02:01 milch kernel: [<ffffffff8112e320>] ? do_task_stat+0x100/0x8cd
Jul 26 05:02:01 milch kernel: [<ffffffff810ebbc9>] ? do_last.isra.43+0xa99/0xbad
Jul 26 05:02:01 milch kernel: [<ffffffff810e96c0>] ? link_path_walk+0x4d/0x79b
Jul 26 05:02:01 milch kernel: [<ffffffff8112a457>] ? proc_single_show+0x43/0x77
Jul 26 05:02:01 milch kernel: [<ffffffff810fcf66>] ? seq_read+0x292/0x571
Jul 26 05:02:01 milch kernel: [<ffffffff810e08c8>] ? vfs_read+0x123/0x1ec
Jul 26 05:02:01 milch kernel: [<ffffffff810e0d70>] ? SyS_read+0x41/0x84
Jul 26 05:02:01 milch kernel: [<ffffffff813c60ca>] ? system_call_fastpath+0x16/0x1b
Jul 26 05:02:01 milch kernel: Code: 8b 7c 24 08 be 00 02 00 00 e8 f2 ab c9 ff 31 c0 5a c3 b8 00 01 00 00 f0 66 0f c1 07 0f b6 d4 38 c2 74 0a 8a 07 38 d0 74 04 f3 90 <eb> f6 c3 9c 58 66 66 90 66 90 48 89 c6 fa 66 66 90 66 66 90 ba
Jul 26 05:02:01 milch kernel: INFO: NMI handler (ffffffff810102af) took too long to run: 24.023 msecs
Jul 26 05:06:04 milch kernel: INFO: rcu_sched detected stalls on CPUs/tasks: { 1} (detected by 0, t=6002 jiffies, g=6340253, c=6340252, q=7283)
We regularly get either a complete system hang or the system becomes almost unusable with segfaults etc. We see messages like the one below in dmesg.
The system is a virtual kvm machine running on an Intel Xeon system. I can give further info if needed. I found this older forum entry which seems to have somewhat simliar stack traces:
viewtopic.php?t=3060&p=12154
Any ideas what I could do to debug further? As said, it happens irregularly which makes debugging harder.
Jul 26 05:02:01 milch kernel: INFO: rcu_sched detected stalls on CPUs/tasks: { 1} (detected by 0, t=6002 jiffies, g=6339563, c=6339562, q=6087)
Jul 26 05:02:01 milch kernel: sending NMI to all CPUs:
Jul 26 05:02:01 milch kernel: NMI backtrace for cpu 0
Jul 26 05:02:01 milch kernel: CPU: 0 PID: 20938 Comm: wc Not tainted 3.15.5-grsec #1
Jul 26 05:02:01 milch kernel: Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Jul 26 05:02:01 milch kernel: task: ffff880039188000 ti: ffff880039188510 task.ti: ffff880039188510
Jul 26 05:02:01 milch kernel: RIP: 0010:[<ffffffff810268ef>] [<ffffffff810268ef>] ffffffff810268ef
Jul 26 05:02:01 milch kernel: RSP: 0000:ffff88007fc03e28 EFLAGS: 00000046
Jul 26 05:02:01 milch kernel: RAX: 0000000000000c00 RBX: 0000000000000002 RCX: 0000000000000007
Jul 26 05:02:01 milch kernel: RDX: ffffffff81501640 RSI: 0000000000000002 RDI: 0000000000000086
Jul 26 05:02:01 milch kernel: RBP: 0000000000000c00 R08: ffffffff816a58c4 R09: 0000000000000000
Jul 26 05:02:01 milch kernel: R10: 0000000000000000 R11: ffffffff81026700 R12: 0000000000000003
Jul 26 05:02:01 milch kernel: R13: ffffffff81546f40 R14: 0000000000000000 R15: ffffffff81547000
Jul 26 05:02:01 milch kernel: FS: 000003fe22b6c700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
Jul 26 05:02:01 milch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 26 05:02:01 milch kernel: CR2: 000003fe226cf000 CR3: 00000000013d1000 CR4: 00000000000407f0
Jul 26 05:02:01 milch kernel: Stack:
Jul 26 05:02:01 milch kernel: ffff88007fc0c968 0000000000000086 0000000000002710 ffff88007fc0c470
Jul 26 05:02:01 milch kernel: 0000000000000001 ffffffff81024390 ffffffff81546f40 ffffffff810923c8
Jul 26 05:02:01 milch kernel: 00000000ffffffff 00000000000017c7 0000000000000083 ffffffff8157d740
Jul 26 05:02:01 milch kernel: Call Trace:
Jul 26 05:02:01 milch kernel: <IRQ>
Jul 26 05:02:01 milch kernel: [<ffffffff81024390>] ? arch_trigger_all_cpu_backtrace+0x42/0x6a
Jul 26 05:02:01 milch kernel: [<ffffffff810923c8>] ? rcu_check_callbacks+0x40b/0x4e5
Jul 26 05:02:01 milch kernel: [<ffffffff81065870>] ? update_process_times+0x31/0x56
Jul 26 05:02:01 milch kernel: [<ffffffff81099800>] ? tick_sched_timer+0x30/0x4c
Jul 26 05:02:01 milch kernel: [<ffffffff81077404>] ? __run_hrtimer.isra.27+0x50/0xa9
Jul 26 05:02:01 milch kernel: [<ffffffff81026717>] ? default_inquire_remote_apic+0xf/0xf
Jul 26 05:02:01 milch kernel: [<ffffffff81077a9b>] ? hrtimer_interrupt+0xd9/0x1d4
Jul 26 05:02:01 milch kernel: [<ffffffff810231f3>] ? smp_apic_timer_interrupt+0x36/0x46
Jul 26 05:02:01 milch kernel: [<ffffffff813c6c26>] ? apic_timer_interrupt+0x86/0x90
Jul 26 05:02:01 milch kernel: <EOI>
Jul 26 05:02:01 milch kernel: Code: 5f ff f6 c4 10 75 f2 44 89 e0 c1 e0 18 89 04 25 10 a3 5f ff 89 d8 09 e8 81 cd 00 04 00 00 83 fb 02 0f 44 c5 89 04 25 00 a3 5f ff <57> 9d 66 66 90 66 90 48 83 c4 10 5b 5d 41 5c c3 41 54 55 53 89
Jul 26 05:02:01 milch kernel: NMI backtrace for cpu 1
Jul 26 05:02:01 milch kernel: CPU: 1 PID: 20893 Comm: pidof Not tainted 3.15.5-grsec #1
Jul 26 05:02:01 milch kernel: Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Jul 26 05:02:01 milch kernel: task: ffff88007d388fa0 ti: ffff88007d3894b0 task.ti: ffff88007d3894b0
Jul 26 05:02:01 milch kernel: RIP: 0010:[<ffffffff813c4f04>] [<ffffffff813c4f04>] ffffffff813c4f04
Jul 26 05:02:01 milch kernel: RSP: 0018:ffff880003a8bc90 EFLAGS: 00000097
Jul 26 05:02:01 milch kernel: RAX: 0000000000002502 RBX: ffff8800390439c0 RCX: 0000000000000010
Jul 26 05:02:01 milch kernel: RDX: 0000000000000025 RSI: ffff880003a8bd70 RDI: ffff8800390441c8
Jul 26 05:02:01 milch kernel: RBP: ffff880003a8bd70 R08: 0000000000000001 R09: 0000000000000001
Jul 26 05:02:01 milch kernel: R10: 00000000fffffffc R11: ffff88007b06c650 R12: ffff88007b06c650
Jul 26 05:02:01 milch kernel: R13: ffff88007d32b200 R14: ffffffff81545080 R15: 0000000000000000
Jul 26 05:02:01 milch kernel: FS: 000003a3a5bda700(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
Jul 26 05:02:01 milch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 26 05:02:01 milch kernel: CR2: 000003a3a5df8038 CR3: 00000000013d3000 CR4: 00000000000407f0
Jul 26 05:02:01 milch kernel: Stack:
Jul 26 05:02:01 milch kernel: ffffffff81067580 ffff88002bffa550 ffff88007b06c650 0000000000000000
Jul 26 05:02:01 milch kernel: ffffffff8112e320 ffff88007d0db8e0 ffffffff81545080 000000017c824c80
Jul 26 05:02:01 milch kernel: ffffffff810ebbc9 ffff88007d10002c ffff880003a8be20 ffff88007d10002c
Jul 26 05:02:01 milch kernel: Call Trace:
Jul 26 05:02:01 milch kernel: [<ffffffff81067580>] ? __lock_task_sighand+0x45/0x6d
Jul 26 05:02:01 milch kernel: [<ffffffff8112e320>] ? do_task_stat+0x100/0x8cd
Jul 26 05:02:01 milch kernel: [<ffffffff810ebbc9>] ? do_last.isra.43+0xa99/0xbad
Jul 26 05:02:01 milch kernel: [<ffffffff810e96c0>] ? link_path_walk+0x4d/0x79b
Jul 26 05:02:01 milch kernel: [<ffffffff8112a457>] ? proc_single_show+0x43/0x77
Jul 26 05:02:01 milch kernel: [<ffffffff810fcf66>] ? seq_read+0x292/0x571
Jul 26 05:02:01 milch kernel: [<ffffffff810e08c8>] ? vfs_read+0x123/0x1ec
Jul 26 05:02:01 milch kernel: [<ffffffff810e0d70>] ? SyS_read+0x41/0x84
Jul 26 05:02:01 milch kernel: [<ffffffff813c60ca>] ? system_call_fastpath+0x16/0x1b
Jul 26 05:02:01 milch kernel: Code: 8b 7c 24 08 be 00 02 00 00 e8 f2 ab c9 ff 31 c0 5a c3 b8 00 01 00 00 f0 66 0f c1 07 0f b6 d4 38 c2 74 0a 8a 07 38 d0 74 04 f3 90 <eb> f6 c3 9c 58 66 66 90 66 90 48 89 c6 fa 66 66 90 66 66 90 ba
Jul 26 05:02:01 milch kernel: INFO: NMI handler (ffffffff810102af) took too long to run: 24.023 msecs
Jul 26 05:06:04 milch kernel: INFO: rcu_sched detected stalls on CPUs/tasks: { 1} (detected by 0, t=6002 jiffies, g=6340253, c=6340252, q=7283)