an PAX enabled kernel is crashing my system every night when the DSL provider resets the ipv4 connection.
I've an ipv6 tunnel to hurrican electric terminating inside a virtual machine running on gentoo hardened. When the ipv4 connection is reset the system with the ipv6 tunnel panics. When I start a ping to any ipv6 address and reset the pppoe deamon on the ipv4 internet firewall I can reproduce an instant panic every time.
I found the same problem analyzed here: http://forums.gentoo.org/viewtopic-t-1003804.html with instructions to address that to the Pax team, but no evidence that that has happened till now.
Digging a bit around I was able to nail the panic to the size overflow protection in _decode_session6 for skb_network_header_len and even avoiding it by disabling the size overflow protection for skb_network_header_len.
(This is probably related to thread https://forums.grsecurity.net/viewtopic.php?f=1&t=4033 and I used the instructions there for debugging.)
So what happens when the IPv4 connection is reset and I've a ping6 running over the ipv6 tunnel?
There is no PAX or other message at all, the system panics:
- Code: Select all
Kernel panic - not syncing: Aiee, killing interrupt handler!
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.17.2-hardened-r1 #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
0000000000000000 0000000000000000 ffff88011fc437e8 ffffffff91489ba4
ffffffff915ff982 ffff88011fc43868 ffffffff914860df 0000000000000008
ffff88011fc43878 ffff88011fc43810 0000000000000000 0000000000000000
Call Trace:
<IRQ> [<ffffffff91489ba4>] dump_stack+0x45/0x5c
[<ffffffff914860df>] panic+0xc8/0x211
[<ffffffff9103ce98>] do_exit+0xad/0x911
[<ffffffff91005047>] ? show_stack_log_lvl+0x104/0x119
[<ffffffff9103e594>] do_group_exit+0x4b/0x10d
[<ffffffff91100776>] report_size_overflow+0x41/0x41
[<ffffffff9146c43d>] _decode_session6+0x198/0x30e
[<ffffffff914310cf>] __xfrm_decode_session+0x41/0x58
[<ffffffff9145d71e>] icmpv6_route_lookup+0xcc/0x152
[<ffffffff91452255>] ? fib6_rule_lookup+0x37/0x45
[<ffffffff9145e0c1>] icmp6_send+0x5b1/0x7a7
[<ffffffff9148e7d1>] ? _raw_read_unlock_bh+0x28/0x31
[<ffffffff9144d73d>] ? ip6_pol_route_lookup+0x19d/0x1b5
[<ffffffff91452255>] ? fib6_rule_lookup+0x37/0x45
[<ffffffff9147693a>] icmpv6_send+0x40/0x4d
[<ffffffff91474409>] ipip6_err+0x20c/0x27c
[<ffffffff9142d1d7>] tunnel64_err+0x36/0x4f
[<ffffffff914150bf>] icmp_socket_deliver+0xc1/0xce
[<ffffffff91415329>] icmp_unreach+0x1cf/0x1ee
[<ffffffff91415f5f>] icmp_rcv+0x1c4/0x374
[<ffffffff913e6e79>] ip_local_deliver_finish+0x11b/0x1f2
[<ffffffff913e70bf>] ip_local_deliver+0x7c/0x86
[<ffffffff913e6d16>] ip_rcv_finish+0x28f/0x2d7
[<ffffffff913e73b8>] ip_rcv+0x2ef/0x361
[<ffffffff913af501>] __netif_receive_skb_core+0x628/0x677
[<ffffffff913af56c>] __netif_receive_skb+0x1c/0x71
[<ffffffff913af5fe>] netif_receive_skb_internal+0x3d/0x78
[<ffffffff913af64b>] netif_receive_skb+0x12/0x1a
[<ffffffff91365642>] virtnet_receive+0x646/0x6a3
[<ffffffff913656c7>] virtnet_poll+0x28/0x98
[<ffffffff913afca1>] net_rx_action+0x120/0x231
[<ffffffff9103f0f0>] __do_softirq+0x10e/0x213
[<ffffffff9103f39b>] irq_exit+0x40/0x8a
[<ffffffff91004581>] do_IRQ+0xc1/0xe0
[<ffffffff91490386>] common_interrupt+0x86/0x86
<EOI> [<ffffffff9100ad8c>] ? sched_clock+0x9/0x13
[<ffffffff9100ba5f>] ? hard_enable_TSC+0x21/0x21
[<ffffffff9102ef00>] ? native_safe_halt+0x6/0xe
[<ffffffff9100ba68>] default_idle+0x9/0x13
[<ffffffff9100c1ea>] arch_cpu_idle+0x17/0x1f
[<ffffffff9106b684>] cpu_startup_entry+0x107/0x201
[<ffffffff910284ed>] ? lapic_resume+0x2ae/0x2ae
[<ffffffff910267c5>] start_secondary+0x22f/0x23a
Kernel Offset: 0x10000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: Aiee, killing interrupt handler!
I'm running gentoo hardened and have reproduced the issue with "linux-3.15.10-hardened-r1" and "linux-3.17.2-hardened-r1_fixed". That later is using 4420_grsecurity-3.0-3.17.2-201411062034.patch.
With the debug code from ephox and disabling the size overflow protection for skb_network_header_len (to prevent the kernel panic so I can get the debug printk messages) I get this in dmesg instead of the kernel panic:
- Code: Select all
[ 541.338918] PAX _decode_session6: transport_header: 76, network_header: 4e
[ 541.385364] PAX _decode_session6: transport_header: 62, network_header: 7e
[ 542.387031] PAX _decode_session6: transport_header: 62, network_header: 7e
[ 543.389107] PAX _decode_session6: transport_header: 62, network_header: 7e
[ 544.390073] PAX _decode_session6: transport_header: 62, network_header: 7e
[ 545.391771] PAX _decode_session6: transport_header: 62, network_header: 7e
[ 546.393425] PAX _decode_session6: transport_header: 62, network_header: 7e
[ 547.394839] PAX _decode_session6: transport_header: 62, network_header: 7e
(To prevent the pernel panic I removed "skb_network_header_len" from "size_overflow_hash.data")
Looks like _decode_session6 is called for each ping or more likely for each response:
Running tcpdump I see ipv4 packets as reply for the ipv6 pings. The first one is an icmp type 3 code 13 (Communication administratively filtered) with a length of 94 bytes. For all following ipv6 pings send out by the tunnel interface there in an icmp 3 code 3 (Port unreachable) with a length of 166 bytes. (counted including the ethernet headers).
Looking at the "translated" ipv6 icmp messages the first ipv4 icmp reply is indeed lost on translation, there is no corresponding IPv6 message. But the first ipv6 ping which is getting a port unreachable response is getting two type ipv6 1 code 3 (Address unreachable) replys, with about 45ms delay in the cature I have. (The sequenzenumber is making that quite clear)
I assume now, that there is a bug in the linux kernel, somehow mangeling the communication administratively filtered icmp4 packet and triggering the pax check by doing that.
So any idea how to debug that further or get it fixed? I'm way out of my depth here and surprised I got so far at all...