Checkpoint Crash: CPU stuck

How Checkpoint crashed with log: BUG: soft lockup - CPU#0 stuck for 10s! [vpnd]


1. Device died during policy installation


2. Message from console cable:

BUG: soft lockup - CPU#0 stuck for 10s! [vpnd:xxxx]

Pid: xxxx , comm:                 vpnd
EIP: 0060 CPU: 0
EIP is at fwlock_sharing_master_lock+0x38/0x2b0 [fw_0]
EFLAGS: 00000286    Tainted: P       (2.6.18-92cp #1)
EAX: cce30000 EBX: f3a57540 ECX: cce33ef8 EDX: ab77e7c0
ESI: 00000001 EDI: c0187ae6 EBP: cce33ef8 DS: 007b ES: 007b
CR0: 8005003b CR2: 77e9a004 CR3: 6f9a9200 CR4: 000006f0
fwdrv_ioctl_wrapper_do+0x7c/0x120 [fw_0]
fwdrv_ioctl_wrapper+0x2b/0x60 [fw_0]
cpkiomux_ioctl+0x3e/0x90 [fw_0]
fw__ioctl+0x84/0x200 [fw_0]
cpdrv_unlocked_ioctl+0x18/0x20 [fw_0]
cpdrv_unlocked_ioctl+0x0/0x20 [fw_0]
do_ioctl+0x2b/0x90
vfs_ioctl+0x5c/0x2b0
sys_ioctl+0x72/0x90
syscall_call+0x7/0xb
=======================


Unit stucked with this message. Device lost control with management station and with cluster member.


3. Cluster switched to secondary device

Unfortunatelly device stoped to process traffic.

4. Logs from secondary firewall


kernel: [fw4_0]; Assertion 0 failed in kiss_handles.c:557
kernel: [fw4_0];fwhandle_get: Invalid handle
kernel: [fw4_0];fwhandle_get(fwmspi.c:2977): Table kbufs - Invalid handle 237480f (bad entry)
kernel: [fw4_0];Assertion 0 failed in kiss_handles.c:557
kernel: [fw4_0];fwhandle_get: Invalid handle
kernel: [fw4_0];fwhandle_get(fwmspi.c:2977): Table kbufs - Invalid handle 237480f (bad entry)
kernel: [fw4_0];Assertion 0 failed in kiss_handles.c:557
kernel: [fw4_0];fwhandle_get: Invalid handle
kernel: [fw4_0];fwhandle_get(fwmspi.c:2977): Table kbufs - Invalid handle 237480f (bad entry)



kernel: [fw4_0]; fwhandle_get: Invalid handle
kernel: [fw4_0]; fwkbuf_leak_log_action: Failed to log action remove from table (00000169) on kbuf 2f1f006
kernel: [fw4_0];fwhandle_get(hashlong.c:1379): Table kbufs - Invalid handle 2f1f006 (bad entry)
kernel: [fw4_0];Assertion 0 failed in kiss_handles.c:557
kernel: [fw4_0];fwhandle_get: Invalid handle
kernel: [fw4_0]; FW-1: fwkbuf_free_with_destr_(hashlong.c 1379): kbuf id is not found: 2f1f006



kernel: [fw4_0];hi_kbufs_free_all: Error freeing kbuf 2f1f006 at index 0 in table MSPI_cluster_update(361)
kernel: [fw4_0];hi_kbufs_free_all: Table entry: <660 : 0 361>
kernel: [fw4_0];fwhandle_get(fwbuf.c:1364): Table kbufs - Invalid handle 2328000 - entry used for handle df328000 with value e75fcdf4
kernel: [fw4_0];Assertion 0 failed in kiss_handles.c:557
kernel: [fw4_0];fwhandle_get: Invalid handle - verifier


Seems to Checkpoint was affected with some bug and was unable to maintain firewall session table.

Solution

1. Unit with stucked cpu had to be rebooted manually by pulling plug off and on. Unit took master role and started to process traffic.
2. Unit with affected session table was rebooted.
3. Cluster failover was tested with success
4. Cluster was upgraded to 77.30 (previously in 77.20).
5. Cluster failover was tested with success