Page 1 of 2

Kernel 2.4.31 + grsecurity 2.1.6 SMP server freezes.

PostPosted: Thu Sep 15, 2005 4:34 am
by andutt
Hi we have experienced this problem since the 2.4.26 kernel which was last stable for us.

We have hyperthreading enabled, and the problem seem to be most frequent on servers running java with much ip requests. We have been able to get an oops and here is the stack trace runned thru ksymoops.

A related post that i have posted berfore can be review here...

http://forums.grsecurity.net/viewtopic. ... highlight=

I hope this will help with debugging... let me know if i can assist with something more...

Code: Select all
Stack: ffffffff d46bc060 00000001 efe0d8f4 efe0d800 df092c00 eadf4000 c02c3529
00000002 00000002 00000282 00000020 3abc7df9 c4307ccc 00000001 efe0d82c
Call Trace: c02be3d7 c02c3529 c02a131e c02a3d4c c02a4125 c02a3d4c c028f483
f8906fae f890719f f8906fae c028f74f c016dff9 c0153a00 c01567f8 c0163b57
c014f2e0 c014f3ae c014d1ce
Code: f3 90 7e f5 e9 f5 e6 ff ff 80 3a 00 f3 90 7e f9 e9 22 e8 ff
Using defaults from ksymoops -t elf32-i386 -a i386


Trace; c02be3d7 <ip_rt_ioctl+6407/a080>
Trace; c02c3529 <rpc_restart_call+df9/3420>
Trace; c02a131e <tcp_read_sock+f20e/13f90>
Trace; c02a3d4c <tcp_read_sock+11c3c/13f90>
Trace; c02a4125 <tcp_read_sock+12015/13f90>
Trace; c02a3d4c <tcp_read_sock+11c3c/13f90>
Trace; c028f483 <ip_cmsg_recv+3213/5ea0>

Code;  00000000 Before first symbol
00000000 <_EIP>:
Code;  00000000 Before first symbol
   0:   f3 90                     repz nop
Code;  00000002 Before first symbol
   2:   7e f5                     jle    fffffff9 <_EIP+0xfffffff9>
Code;  00000004 Before first symbol
   4:   e9 f5 e6 ff ff            jmp    ffffe6fe <_EIP+0xffffe6fe>
Code;  00000009 Before first symbol
   9:   80 3a 00                  cmpb   $0x0,(%edx)
Code;  0000000c Before first symbol
   c:   f3 90                     repz nop
Code;  0000000e Before first symbol
   e:   7e f9                     jle    9 <_EIP+0x9>
Code;  00000010 Before first symbol
  10:   e9 22 e8 ff 00            jmp    ffe837 <_EIP+0xffe837>

CPU: 1
EIP: c02f26e1
EFLAGS: 00000086
eax: 00006803 ebx: f73b0000 ecx: f4e5e720 edx: f77bc480 esi: d4692000
ds: 0018 es:0018 ss: 0018
Process java (pid: 714, stackpage=f73b1000)
Stack: c0150708 53403d80 00870000 f73b0000 00002d92 d4692000 00000000 c0164b69
d4692000 f73b1f98 fffffffc fffffff2 00000005 43288560 0002f849 f73b0000
40027780 51d2eac0 534041bc c0151633 00002d92 00000000 534041d4 40027780
Call Trace: c0150708 c0164b69 c0151633
Code: 7e f5 e9 d0 f6 ff ff e8 d7 d9 e5 ff e9 ee f7 ff ff e8 cd d9


>>ebx; f73b0000 <_end+36fbd720/3850d780>
>>ecx; f4e5e720 <_end+34a6be40/3850d780>
>>edx; f77bc480 <_end+373c9ba0/3850d780>
>>esi; d4692000 <_end+1429f720/3850d780>

Trace; c0150708 <dump_stack+108/1860>
Trace; c0164b69 <__out_of_line_bug+2b9/740>
Trace; c0151633 <dump_stack+1033/1860>

Code;  00000000 Before first symbol
00000000 <_EIP>:
Code;  00000000 Before first symbol
   0:   7e f5                     jle    fffffff7 <_EIP+0xfffffff7>
Code;  00000002 Before first symbol
   2:   e9 d0 f6 ff ff            jmp    fffff6d7 <_EIP+0xfffff6d7>
Code;  00000007 Before first symbol
   7:   e8 d7 d9 e5 ff            call   ffe5d9e3 <_EIP+0xffe5d9e3>
Code;  0000000c Before first symbol
   c:   e9 ee f7 ff ff            jmp    fffff7ff <_EIP+0xfffff7ff>
Code;  00000011 Before first symbol
  11:   e8 cd d9 00 00            call   d9e3 <_EIP+0xd9e3>


1 warning and 1 error issued.  Results may not be reliable.

PostPosted: Fri Sep 16, 2005 4:02 am
by SG
Not only java. My servers with resin/java was frozen and my db servers with sybase was wrozen too.
I using 2.6.12.4 with vserver 2.0 patch at now and all Ok. I will waiting pax/grsec patch on the kernel with vserver patch or will try do it myself.
I don`t have enough expirience for it, because I will waiting other peoples with good kernel skills...

PostPosted: Fri Sep 16, 2005 5:22 am
by andutt
SG are u using the tg3 network driver on your systems?

It has come to my attention that it can be related to that, it can be a good call since all the oops have been ip related. It have been quite big changes in the tg3 driver since 2.4.26 to 2.4.27 (about 3500 lines) and up.

So we will try a clean vanilla kernel without grsec to see if we can reproduce the error.. if so maybe its not a grsecurity issue. I will post our progress here...

/Andutt

PostPosted: Fri Sep 16, 2005 11:33 am
by SG
No. intel 100/1000 and r8169

I tried disable SMP in kernell and nothing freeze. I using vanilla + vserver patch also and none freeze with SMP in kernel.

This is not bug of vanilla kernel. I made own grsec patch from 2.4.26 version for 2.4.28 kernel. With this patch my smp boxes not freeze!

PostPosted: Mon Sep 19, 2005 2:37 am
by andutt
Ok SG, what code sections did you skip in your modifed patch?

PostPosted: Mon Sep 19, 2005 3:23 am
by SG
I simply apply patch grsecurity-2.0-2.4.26.patch to 2.4.28 kernel. Program
'patch' reject some parts of this patch and I applied their manually. I don`t skip any code. If you want then I can send grsecurity-2.0-2.4.28.4sg.tar.gz to you. But it is patch for old kernel and not actual at now.

It is proof some bugs with smtp of grsec or pax patch after 2.4.26 patch...

PostPosted: Mon Sep 19, 2005 3:25 am
by SG
sorry. not smtp :) SMP!

PostPosted: Mon Sep 19, 2005 5:36 pm
by spender
andutt: have you tried the latest patch in http://grsecurity.net/~spender ?

-Brad

PostPosted: Mon Sep 19, 2005 7:35 pm
by bani
does the latest patch fix this bug?

PostPosted: Mon Sep 19, 2005 7:40 pm
by spender
It's not fixed yet, which is why 2.1.7 hasn't been officially released yet. The PaX team is working on debugging the problem, however, and a release will be hopefully forthcoming shortly.

-Brad

GRsec SMP freeze - possible cause

PostPosted: Tue Sep 20, 2005 1:14 am
by Kp
I can't tell from the wording of spender's message whether he expects 2.1.7 to resolve the hang, so please forgive me if you've already solved this. I looked through the 2.1.7 patch, and it still has the section which concerned me.

After andutt mentioned that the 2.4.26 patch worked OK on a 2.4.28 kernel, I went back and checked the changelog for GRsecurity between those two versions. The biggest change I noticed was the switch from macros to functions for logging, which led me to look at the GRsecurity logging code (in /grsecurity/grsec_log.c). I'm a bit concerned about how many locks are being grabbed in the BEGIN_LOCKS macro. I don't doubt that they're needed to prevent oopses and other unpleasant events during logging, but grabbing that many locks at once could easily lead to deadlock if there're other parts of the code that don't acquire them in the same order.

Is anyone able to reliably reproduce the hang/oops quickly (minutes or hours, not days or weeks)? Second, to spender or any readers who feel comfortable hacking on the code: how much trouble would it be to rearrange the logging code so that it acquired locks just long enough to copy the required data into local variables, rather than the current design of holding all the locks for the entire logging function? I'm willing to tackle this if you're busy with other projects, but I must warn that I don't normally leave userland.

PostPosted: Tue Sep 20, 2005 7:38 pm
by spender
The locks that are acquired in BEGIN_LOCKS() were the same locks that were acquired in the security_alert() macros prior to the change to logging functions. Since some of the data we generate in the logging functions are pathnames, it's not possible to copy that data into local variables due to stack size limitations. Also, the problem andutt was reporting involves an oops, which you don't get with deadlocks. I'm still curious to see if anyone has these problems with the latest patch in ~spender.

BTW Kp, the problem I was referring to with my previous post was the boot-time problem with ld.so.

-Brad

PostPosted: Wed Sep 21, 2005 2:18 am
by andutt
Hi Brad

I have compiled in your latest patch and will give it some testing. Will be back with info as soon i have something to share..

PostPosted: Thu Sep 22, 2005 2:44 am
by andutt
I shall also say that we had to enable nmi_watchdog to be able to get a kernel oops, before that the system just freezed, nothing on the console. So maybe that point to a deadlock situation? will see...im now testing 2 parallell senarios.. will be back with info.

My experience in this

PostPosted: Fri Nov 04, 2005 6:28 am
by marcolinuz
Hello all,

I'm here beacause recently I had the same problem with the same configuration menioned in this thread and I would like to share my experience.

My server Freezes itself with an average uptime of 4 weeks and my auditing operations reveals that mostly happened in the early morning... :O)

Mumble.. mumble.. In early morning my server performs a mirror of himself for delivery reasons and this operation makes a strong use of "wget".
My mirror script performs its work in an aggressive mode over several instances of jetty and tomcat (both are java processes).

In my opinion, sometime this high connection stress over the jvm could cause the bug to jump out of the hat.. :O)
Now I will try to reduce the "aggessivity" of wget by adding the flags "--limit-rate=200k --wait=1".
If the freeze will come again.. then I could suppose that network load on jvm is not the reason that can rise the bug.

Anyway, I will look forward for the new release of grsecurity that fixes this important bug.

PS: Thanks to the grsecurity team for the great job.

By Marco.