Page 1 of 1

refcount overflow detected in ext4 ?

PostPosted: Tue Jun 25, 2013 11:13 am
by nbareil
Hello,


On a new AMD64 server running 3.2.47-grsec-vs2.3.2.16+, every times a task becomes I/O intensive, the kernel crashes with the following logs:

Code: Select all
Jun 25 16:46:27 pouic kernel: PAX: From x.x.x.x: refcount overflow detected in: imap:4578, uid/euid: 1000/1000
Jun 25 16:46:27 pouic kernel: CPU 0
Jun 25 16:46:27 pouic kernel: Pid: 4578, comm: imap Not tainted 3.2.47-grsec-vs2.3.2.16+ #1 HP ProLiant DL120 G7
Jun 25 16:46:27 pouic kernel: RIP: 0010:[<ffffffff810d537e>]  [<ffffffff810d537e>] kfree+0xce/0x120
Jun 25 16:46:27 pouic kernel: RSP: 0018:ffff8800edba9d28  EFLAGS: 00000886
Jun 25 16:46:27 pouic kernel: RAX: 0000000000000002 RBX: ffff8800ed8ac7c0 RCX: 0000000000000000
Jun 25 16:46:27 pouic kernel: RDX: ffff8801045ce000 RSI: 0000000000000080 RDI: ffff88010b000140
Jun 25 16:46:27 pouic kernel: RBP: ffff8800edba9d48 R08: 00000001e45d67aa R09: 0000000000000008
Jun 25 16:46:27 pouic kernel: R10: 000000000000002f R11: 0000000035383a32 R12: ffff88010b01f000
Jun 25 16:46:27 pouic kernel: R13: 0000000000000293 R14: 0000000000000000 R15: ffff8801062b9c80
Jun 25 16:46:27 pouic kernel: FS:  000003262d38b700(0000) GS:ffff88010bc00000(0000) knlGS:0000000000000000
Jun 25 16:46:27 pouic kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 25 16:46:27 pouic kernel: CR2: ffffffffff600400 CR3: 0000000001578000 CR4: 00000000000406b0
Jun 25 16:46:27 pouic kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 25 16:46:27 pouic kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun 25 16:46:27 pouic kernel: Process imap (pid: 4578, threadinfo ffff88010b2010c0, task ffff88010b200cf0)
Jun 25 16:46:27 pouic kernel: Stack:
Jun 25 16:46:27 pouic kernel: ffff8800ea7570a8 ffff8800ed8ac7c8 ffff8800eb3ade00 ffff8800eb02b9c8
Jun 25 16:46:27 pouic kernel: ffff8800edba9d78 ffffffff81181e19 ffff8800eb3ade00 ffff8800ed59c468
Jun 25 16:46:27 pouic kernel: ffffffff810f6290 ffff880106664000 ffff8800edba9e18 ffffffff811821e7
Jun 25 16:46:27 pouic kernel: Call Trace:
Jun 25 16:46:27 pouic kernel: [<ffffffff81181e19>] free_rb_tree_fname+0x59/0xd0
Jun 25 16:46:27 pouic kernel: [<ffffffff810f6290>] ? filldir64+0x2b0/0x2b0
Jun 25 16:46:27 pouic kernel: [<ffffffff811821e7>] ext4_readdir+0xe7/0x5b0
Jun 25 16:46:27 pouic kernel: [<ffffffff810f6290>] ? filldir64+0x2b0/0x2b0
Jun 25 16:46:27 pouic kernel: [<ffffffff810f6290>] ? filldir64+0x2b0/0x2b0
Jun 25 16:46:27 pouic kernel: [<ffffffff810f684d>] vfs_readdir+0xcd/0x100
Jun 25 16:46:27 pouic kernel: [<ffffffff810f69eb>] sys_getdents+0xdb/0x1d0
Jun 25 16:46:27 pouic kernel: [<ffffffff81565910>] system_call_fastpath+0x18/0x1d


The imap (dovecot server) is running inside a vserver instance.

How can I help to debug?


Re: refcount overflow detected in ext4 ?

PostPosted: Tue Jun 25, 2013 12:21 pm
by PaX Team
can you also upload vmlinux (the one in the build root dir) please?

Re: refcount overflow detected in ext4 ?

PostPosted: Tue Jun 25, 2013 12:36 pm
by nbareil
Sure, here it is : http://dl.free.fr/pO5bE4v1V

Thanks!

Re: refcount overflow detected in ext4 ?

PostPosted: Tue Jun 25, 2013 1:05 pm
by PaX Team
i think this will be some bug in vserver itself. the refcount overflow is detected in mm/slab.c:__cache_free() on vx_slab_free(cachep) which in turn does a
Code: Select all
atomic_sub(cachep->buffer_size, &vxi->cacct.slab[what]);
based on the asm+register dump buffer_size was 0x80:
Code: Select all
ffffffff810d536a:       f0 29 b4 82 10 0b 00 00         lock sub %esi,0xb10(%rdx,%rax,4)
which then triggered the overflow detection logic inserted by the refcount protection. now whether the problem is that the atomic_t type is too small to reliably store whatever cacct.slab[] is collecting or there's some asymmetry (read: bug) somewhere, i don't know but you should definitely let the vserver folks know about this as it's not the usual false positive.

Re: refcount overflow detected in ext4 ?

PostPosted: Tue Jun 25, 2013 5:45 pm
by spender
I've reported this to the vserver developers and created patches to fix this and an information leak I found in the process. It will be fixed in my next patch and in the next vserver patches upstream.

-Brad