Page 1 of 1
full learning mode crashes box (2.0rc3)
Posted:
Mon Aug 18, 2003 4:46 am
by Sleight of Mind
on my router box i recently installed a kernel with the 2.0rc3 patch. I use the grsec patch series for quite some time now but never fully exploited the ACL system it has and so i started experimenting with it.
Since the learning system sounded nice to me i did not make any ACLs myself but started with enabling the Full learning system as described in /etc/grsec/acl (from gradm2)
- Code: Select all
gradm -F -L /etc/grsec/learning.logs
While running, the learning mode caused crashes on my box 3 times now. One time during a configure process and 2 times during an 'emerge rsync' (yes, this box is running gentoo). The two times it crashed on rsync, random garbage was appended to my learning log file. (binary?!)
Might there be an error in the learning code that causes crashes on big stress? I know both of the processes crashing the box access loads of files, especially the rsync operation accesses all files in the portage tree.
Please let me know if you need more information.
Posted:
Wed Aug 20, 2003 5:19 am
by spender
This is a known bug, but may take a while for me to fix, since I don't know the cause of it. I know that part of the cause is sleeping in the gr_add_learn_entry() function in gracl_learn.c, however I don't know for certain if it's just a hardlock or an oops/panic (though I do not think it's the latter as I've not yet seen messages saying it's either of them). If it is a hardlock, I don't know which lock is causing the problem. I've ruled out the big kernel lock and interrupts being disabled as the problem. Apparently it happens on both UP and SMP, which also rules some things out. I've checked most of the usage around the kernel, except for every location where capable is used (around 600 locations). Could you try disabling the call to security_learn in gracl_cap.c and see if you can still generate the crash? It would help a lot for you to debug it for me. I'm still in the process of moving and don't have the ability to do it myself for another week or so.
-Brad
Posted:
Wed Aug 20, 2003 6:39 am
by Sleight of Mind
I changed gracl_cap.c to:
- Code: Select all
if ((curracl->mode & GR_LEARN)
&& cap_raised(current->cap_effective, cap)) {
/*
security_learn(GR_LEARN_AUDIT_MSG, current->role->rolename,
current->role->roletype, current->uid,
current->gid, current->exec_file ?
gr_to_filename(current->exec_file->f_dentry,
current->exec_file->f_vfsmnt) : curracl->filename,
curracl->filename, 0UL,
0UL, "", (unsigned long) cap, NIPQUAD(current->curr_ip));
*/
return 1;
}
recompiled and rebooted the new kernel. After enabling full learning mode again i ran emerge sync, which went well because i sync'ed less than a day ago and the number of files accessed was relatively small.
So i compiled some stuff to see if that would also hold. The configure and make went Ok, but during the copying of files (make install) i got another hang (hardlock afaik, since the keyboard just stops responding)
I guess this is not the cause of this bug. Any more suggestions?
Posted:
Wed Aug 20, 2003 6:53 am
by spender
look through the grsecurity directory and find other uses of security_learn. There are only a handful of them. Remove them one by one until the problem disappears. Then add back all of the removed ones but the last one you removed. This should tell us what the problem is (it's a relief that the problem isn't capable(), now fixing the problem will be much easier if we can find the source)
-Brad
Posted:
Thu Aug 21, 2003 5:50 am
by Sleight of Mind
All the uses of security_learn:
- Code: Select all
$ grep -n security_learn *
gracl.c:1387: security_learn(GR_LEARN_AUDIT_MSG, role->rolename, role->roletype,
gracl.c:2394: security_learn(GR_LEARN_AUDIT_MSG, current->role->rolename,
gracl_cap.c:58: security_learn(GR_LEARN_AUDIT_MSG, current->role->rolename,
gracl_ip.c:107: security_learn(GR_IP_LEARN_MSG, current->role->rolename,
gracl_ip.c:117: security_learn(GR_IP_LEARN_MSG, current->role->rolename,
gracl_ip.c:173: security_learn(GR_IP_LEARN_MSG, current->role->rolename,
Cutting the problem in about two equal parts i started by commenting out all calls in gracl_ip.c. It seemed this solved the problem, so i added one call back, this made the problem return. Changing to other calls in gracl_ip.c also made the problem return so i commented all three out again and tried real hard to crash te system. It eventually happened.
Since i've tested with gracl_cap.c before only the 2 calls in gracl.c remain. Will test for those later this week/day. A thing that crossed my mind is that is was quite hard to crash the box when all 3 uses of security_learn in gracl_ip.c were commented out. I had to really stress the system. Maybe not a certain call to security_learn is the cause of the bug, but the amount of calls to it?
I'll post again once i've tested on the 2 calls in gracl.c
Posted:
Thu Aug 21, 2003 6:32 am
by spender
It's a bit of both. If you want to be able to crash it more reliably, change
#define LEARN_BUFFER_SLOTS 256
to:
#define LEARN_BUFFER_SLOTS 10
in both gracl_learn.c in the kernel and grlearn.c in gradm.
The problem that causes the lockup is having the buffer be full when a function that isn't supposed to sleep calls the security_learn.
Note that when you remove a security_learn from gracl.c, you'll have to compensate in the usage with more socket/ip access to generate the same amount of logs.
-Brad
Posted:
Thu Aug 21, 2003 7:57 am
by Sleight of Mind
a simple solution might be hard limiting the number of learns per time interval. Just discarding the calls once the limit is reached. This shouldn't be that hard to implement either i think. I know this makes the learning process less than 100% accurate but i (and i think many others with me) prefer a longer learning phase over a box that crashes while learning.
Posted:
Thu Aug 21, 2003 8:52 am
by spender
I'd rather fix the real problem though. Previously, we did just drop the learning logs if we reached a limit. However, the problem is that you lose lots of important data, which is critical when you're doing learning across the entire system. Otherwise, when you generate the learning logs, you won't get a working ruleset. Duplicating usage might not ever fix the problem either, if the access getting dropped always occurs after a large number of sequential accesses. If we find the function at fault, we can fix it without having to drop data.
-Brad
Posted:
Fri Aug 29, 2003 1:50 pm
by spender
any news on this? If not, I'll begin my own debugging this weekend if I have spare time. I won't release an rc3 for 2.4.22 until it's fixed.
-Brad
Posted:
Fri Aug 29, 2003 2:04 pm
by Sleight of Mind
i did'nt test much in the past week due to exams. Editing, recompiling and rebooting, then trying to crash takes a lot of time. Especially when commenting out the most frequent occuring calls to security_learn.
After this weekend i'll have some spare time again and i am willing to help debugging if you like, although i don't know the grsecurity code very well. Didn't really study it. Just let me know if you think i can help with something.