Page 1 of 1

2.6.35.7 and grsecurity-2.2.0-2.6.35.7-201010121028.patch

PostPosted: Mon Oct 18, 2010 2:05 pm
by arekm
Hi,

Did anyone saw a problem like this? When grsec patch is applied then on some systems
various processed end stuck (see below). Usually it's ktime_get_ts is on top of trace.
vanilla kernel seem to be working fine.

[13320.258868] INFO: task rpcbind:14163 blocked for more than 120 seconds.
[13320.258874] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[13320.258888] rpcbind D 00000000ffffffff 0 14163 2127 0x00000000
[13320.258894] ffff880041e87cc8 0000000000000086 ffff880041e87d28
ffff880041e87c78
[13320.258899] ffff880001886980 0000000000011140 0000000000011140
0000000000011140
[13320.258903] ffff880041e87fd8 ffff880041e87fd8 ffff880078fbd580
0000000000011140
[13320.258907] Call Trace:
[13320.258920] [<ffffffff810833e9>] ? ktime_get_ts+0xa9/0xe0
[13320.258928] [<ffffffff810ee9c0>] ? sync_page+0x0/0x50
[13320.258935] [<ffffffff8143240e>] io_schedule+0x6e/0xb0
[13320.258938] [<ffffffff810ee9f8>] sync_page+0x38/0x50
[13320.258943] [<ffffffff81432b92>] __wait_on_bit_lock+0x52/0xb0
[13320.258946] [<ffffffff810ee9a2>] __lock_page+0x62/0x70
[13320.258952] [<ffffffff81069ca0>] ? wake_bit_function+0x0/0x40
[13320.258957] [<ffffffff8110c302>] handle_mm_fault+0xb62/0xba0
[13320.258963] [<ffffffff81437f25>] do_page_fault+0x145/0x440
[13320.258968] [<ffffffff8103cf8d>] ? finish_task_switch+0x3d/0xc0
[13320.258972] [<ffffffff81431e83>] ? schedule+0x2c3/0x7e0
[13320.258975] [<ffffffff81434cd1>] ? restore_args+0x0/0x30
[13320.258979] [<ffffffff81434ec4>] page_fault+0x24/0x30

Re: 2.6.35.7 and grsecurity-2.2.0-2.6.35.7-201010121028.patc

PostPosted: Tue Oct 19, 2010 7:54 am
by arekm
grsecurity-2.2.0-2.6.35.5-201009241805.patch seems to be fine (applied here to 2.6.35.6)
grsecurity-2.2.0-2.6.35.6-201009281623.patch is problematic (applied here to 2.6.35.7)
grsecurity-2.2.0-2.6.35.7-201010121028.patch also problematic (applied here to 2.6.35.7)

Re: 2.6.35.7 and grsecurity-2.2.0-2.6.35.7-201010121028.patc

PostPosted: Tue Oct 19, 2010 8:14 am
by spender
Are you sure this is a grsecurity issue and not a 2.6.35.7 regression? I ask because there are a number of posts to be found about this problem on vanilla kernels. See:

https://bugzilla.redhat.com/show_bug.cgi?id=576749
http://comments.gmane.org/gmane.linux.r ... ral/370542
http://lkml.org/lkml/2010/10/15/264

-Brad

Re: 2.6.35.7 and grsecurity-2.2.0-2.6.35.7-201010121028.patc

PostPosted: Tue Oct 19, 2010 9:11 am
by arekm
Problem happens on dell vostro notebook where no mdadm or any other raid was used - two first links describe raid related problems.

Third link (mail by Pawel Sikora) is the case I'm describing - he is using the same kernels as I am. The problem happens for more people using the same kernel (4 or 5 people so far).

First we thought that the problem is generic but we weren't able to reproduce on vanilla. We did tests on our (heavily) patched kernel with and without grsecurity. Only grsecurity version causes problems.

To be sure that the source or the problem isn't any weird other-patches vs grsecurity patch interaction we are going to test vanilla+grsec today.

Were there any changes between two mentioned grsecurity patches that could cause this?

Re: 2.6.35.7 and grsecurity-2.2.0-2.6.35.7-201010121028.patc

PostPosted: Fri Oct 22, 2010 3:48 am
by arekm
Pawel asked me to post this here. He was testing vanilla + grsec patch and was able to reproduce the problem. Unfortunately no serial console so only photos of oopses here:

http://pluto.agmk.net/kernel/

No softraid on this machine, so the previously mentioned url/bug reports about mdadm problem don't apply here.

Re: 2.6.35.7 and grsecurity-2.2.0-2.6.35.7-201010121028.patc

PostPosted: Fri Oct 22, 2010 5:09 am
by pawels
hi,

i'd like to say that i don't have a deterministic testcase to fire mentioned rpcbind task blocking.
it just happens during normal worktime activity (compiling, walking through automonted nfs shares, etc)
and memtest doesn't report any errors on both machines.

it happend yesterday during nfs services restart:
http://pluto.agmk.net/kernel/accidental ... estart.txt

and today's morning during machine shutdown (on smartd service shutdown):
http://pluto.agmk.net/kernel/*.jpg

finally, it could be a grsec bug or a vanilla bug exposed by grsec.
i'm not a kernel hacker to prove it from stacktraces :)

Re: 2.6.35.7 and grsecurity-2.2.0-2.6.35.7-201010121028.patc

PostPosted: Fri Oct 22, 2010 11:17 pm
by spender
I looked at the interdiff between the patch you said was working and the patch that wasn't working. There's nothing in there that would cause the problems you're exhibiting.

The changes between the two patches were: aio range checking (fixed upstream later), the correct Xen fix (which ended up in 2.6.35.7), some minor refcount changes, a compat mode bugfix (fixed upstream later), and some memsets for infoleaks in various drivers (fixed upstream later).

http://www.gossamer-threads.com/lists/l ... el/1290016
confirms it's not caused by grsecurity/PaX

-Brad

Re: 2.6.35.7 and grsecurity-2.2.0-2.6.35.7-201010121028.patc

PostPosted: Fri Nov 05, 2010 6:36 pm
by linkfanel
Hello,

I've got a similar problem:

Code: Select all
[ 9120.104058] INFO: task rdnssd:2712 blocked for more than 120 seconds.
[ 9120.104064] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 9120.104067] rdnssd        D c248acd0     0  2712   1123 0x00000000
[ 9120.104072]  dcfc06c0 00000086 df402e00 c248acd0 00000002 00115503 c21313a0 00115526
[ 9120.104079]  dcfc06c0 defede90 c248acd0 00000002 0024fe52 defede88 0006644d 00250426
[ 9120.104085]  00000000 00066420 c1c2bca4 00000000 523cff34 dccd4700 00066408 00000002
[ 9120.104091] Call Trace:
[ 9120.104103]  [<00115503>] ? __generic_unplug_device+0x23/0x30
[ 9120.104107]  [<00115526>] ? generic_unplug_device+0x16/0x20
[ 9120.104112]  [<0024fe52>] ? io_schedule+0x22/0x40
[ 9120.104116]  [<0006644d>] ? sync_page+0x2d/0x40
[ 9120.104120]  [<00250426>] ? __wait_on_bit_lock+0x46/0x90
[ 9120.104123]  [<00066420>] ? sync_page+0x0/0x40
[ 9120.104126]  [<00066408>] ? __lock_page+0xa8/0xb0
[ 9120.104132]  [<0003fa10>] ? wake_bit_function+0x0/0x60
[ 9120.104137]  [<0007a63c>] ? handle_mm_fault+0x8dc/0x8f0
[ 9120.104144]  [<000f2800>] ? ext3_group_add+0x10a0/0x1750
[ 9120.104149]  [<0001fc9f>] ? do_page_fault+0x13f/0x400
[ 9120.104153]  [<0001fb60>] ? do_page_fault+0x0/0x400
[ 9120.104157]  [<00251830>] ? error_code+0x8c/0x94
[ 9120.104162]  [<00010202>] ? intel_pmu_enable_all+0x82/0xd0
[ 9120.104166]  [<002513fa>] ? restore_all+0x0/0x18


It happens overnight with grsec, but hasn't happened in a week on a vanilla kernel. It seems always to start with the same process: rdnssd doing a regular fork()+exec(), but the child getting stuck apparently before the exec() call completes. Then other processes like ps start getting randomly stuck too.

If it's not caused by grsec, then it sure is exposed by it :/