Checkpoint Crash: Exit routed

How Checkpoint crashed with log: Exit routed version routed-06.10.2014-23:11:13


1. Device died during policy installation

For 30 seconds there was no master firewall in cluster. That means that traffic was not processed. This is due extended timeouts. During that time there is no cluster member monitoring. This is "standard" checkpoint behavior.


2. Secondary device took master role

3. After 2 min. primary device came up and took master role

4. Logs from primary firewall

Some kind of memory leak:

kernel: [fw4_1];FW-1: [freeze_on_remote] freeze state on remote member 1 has changed from 0 to 1
kernel: [fw4_0];fwhandle_get(sas.c:844): Table kbufs - Invalid handle 303d007 (bad entry)
kernel: [fw4_0];Assertion 0 failed in kiss_handles.c:557
kernel: [fw4_0];fwhandle_get: Invalid handle
kernel: [fw4_0];fwhandle_get(fwbuf.c:1364): Table kbufs - Invalid handle 303d007 (bad entry)
kernel: [fw4_0];Assertion 0 failed in kiss_handles.c:557
kernel: [fw4_0];fwhandle_get: Invalid handle
kernel: [fw4_0]; fwkbuf_leak_log_action: Failed to log action remove from table (000000000000017a) on kbuf 303d007
kernel: [fw4_0];fwhandle_get(hashlong.c:1379): Table kbufs - Invalid handle 303d007 (bad entry)
kernel: [fw4_0];Assertion 0 failed in kiss_handles.c:557


routed daemon was restated:

routed[12580]: Exit routed[12580] version routed-06.10.2014-23:11:13
pm[12245]: Reaped: routed[12580]
pm[12245]: Scheduled routed for +4 secs
routed[12590]: Exit routed[12590] version routed-06.10.2014-23:11:13
pm[12245]: Restarted /bin/routed[13495], count=2
routed[13495]: rt_instance_init: routed manager id -1 initialized itself
routed[13495]: parse_instance_only: my_instance_id -1 parsing instance default
routed[13495]: Commence routing updates
routed[13501]: vrrp_set_fw: Command is fw_is_running_vrrp and value is 0
kernel: [fw4_0];Global param: set int fw_is_running_vrrp to '0'
kernel: [fw4_1];Global param: set int fw_is_running_vrrp to '0'
kernel: [fw4_2];Global param: set int fw_is_running_vrrp to '0'
kernel: [fw4_3];Global param: set int fw_is_running_vrrp to '0'
kernel: [fw4_4];Global param: set int fw_is_running_vrrp to '0'
kernel: [fw4_5];Global param: set int fw_is_running_vrrp to '0'
kernel: [fw4_6];Global param: set int fw_is_running_vrrp to '0'
kernel: [fw4_7];Global param: set int fw_is_running_vrrp to '0'
kernel: [fw4_8];Global param: set int fw_is_running_vrrp to '0'
kernel: [fw4_9];Global param: set int fw_is_running_vrrp to '0'
routed[13501]: vrrp_set_fw: Command is fw_is_running_on_cbs and value is 0
kernel: [fw4_0];Global param: set int fw_is_running_on_cbs to '0'
kernel: [fw4_1];Global param: set int fw_is_running_on_cbs to '0'
routed[13501]: vrrp_set_fw: Command is fwha_cbs_which_member_is_running_gated and value is 0
kernel: [fw4_2];Global param: set int fw_is_running_on_cbs to '0'
kernel: [fw4_3];Global param: set int fw_is_running_on_cbs to '0'
kernel: [fw4_4];Global param: set int fw_is_running_on_cbs to '0'
kernel: [fw4_5];Global param: set int fw_is_running_on_cbs to '0'
kernel: [fw4_6];Global param: set int fw_is_running_on_cbs to '0'
kernel: [fw4_7];Global param: set int fw_is_running_on_cbs to '0'
kernel: [fw4_8];Global param: set int fw_is_running_on_cbs to '0'
kernel: [fw4_9];Global param: set int fw_is_running_on_cbs to '0'
kernel: [fw4_0];Global param: set int fwha_cbs_which_member_is_running_gated to '0'
kernel: [fw4_1];Global param: set int fwha_cbs_which_member_is_running_gated to '0'
kernel: [fw4_1];FW-1: fwha_validate_member_running_gated: 0 is not a valid member id
kernel: [fw4_2];Global param: set int fwha_cbs_which_member_is_running_gated to '0'
kernel: [fw4_3];Global param: set int fwha_cbs_which_member_is_running_gated to '0'
kernel: [fw4_4];Global param: set int fwha_cbs_which_member_is_running_gated to '0'
kernel: [fw4_5];Global param: set int fwha_cbs_which_member_is_running_gated to '0'
kernel: [fw4_6];Global param: set int fwha_cbs_which_member_is_running_gated to '0'
kernel: [fw4_7];Global param: set int fwha_cbs_which_member_is_running_gated to '0'
kernel: [fw4_8];Global param: set int fwha_cbs_which_member_is_running_gated to '0'
kernel: [fw4_9];Global param: set int fwha_cbs_which_member_is_running_gated to '0'
kernel: [fw4_0];Global param: set int fwx_cluster_hide_for_dynamic_routing to '1'
kernel: [fw4_1];Global param: set int fwx_cluster_hide_for_dynamic_routing to '1'
kernel: [fw4_2];Global param: set int fwx_cluster_hide_for_dynamic_routing to '1'
kernel: [fw4_3];Global param: set int fwx_cluster_hide_for_dynamic_routing to '1'
kernel: [fw4_4];Global param: set int fwx_cluster_hide_for_dynamic_routing to '1'
kernel: [fw4_5];Global param: set int fwx_cluster_hide_for_dynamic_routing to '1'
kernel: [fw4_6];Global param: set int fwx_cluster_hide_for_dynamic_routing to '1'
kernel: [fw4_7];Global param: set int fwx_cluster_hide_for_dynamic_routing to '1'
kernel: [fw4_8];Global param: set int fwx_cluster_hide_for_dynamic_routing to '1'
kernel: [fw4_9];Global param: set int fwx_cluster_hide_for_dynamic_routing to '1'
routed[13501]: Commence routing updates


Cluster rebuilt:

kernel: [fw4_1];fwha_report_id_problem_status: Try to update state to FAILURE due to pnote Problem Notification (desc Synchronization)
kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to FAILURE
kernel: [fw4_1];fwha_init_state: Update state of member id 1 to ACTIVE cause we are in all active mode
kernel: [fw4_1];FW-1: fwha_update_state: ID 1 (state FAILURE -> ACTIVE) (time 58.3)
kernel: [fw4_1];fwha_state_change_implied: Try to update state to ACTIVE because member is down and state might should be changed
kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to FAILURE
kernel: [fw4_1];fwha_state_change_implied: Try to update state to ACTIVE because member is down and state might should be changed
kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to FAILURE
kernel: Passive ARP successfully initialized!
kernel: [fw4_1];fwha_report_id_problem_status: Try to update state to ACTIVE due to pnote Problem Notification (desc Synchronization)
kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(0) to READY
kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to READY
kernel: [fw4_1];FW-1: fwha_process_state_msg: Update state of member id 1 to STANDBY due to the member report message
kernel: [fw4_1];FW-1: fwha_update_state: ID 1 (state ACTIVE -> STANDBY) (time 97.0)
kernel: [fw4_1];FW-1: fwha_state_change_implied: Try to update local state from READY to ACTIVE because all other machines confirmed my READY state
kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(0) to ACTIVE
kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to ACTIVE
routed[13501]: entering cpcl_vrf_master_init()
routed[13501]: leaving cpcl_master_init()
routed[13501]: cpcl_vrf_master_listen_accept(5479): entering cpcl_vrf_master_listen_accept
routed[13501]: cpcl_vrf_master_listen_accept(5565): leaving cpcl_vrf_master_listen_accept
routed[13501]: cpcl_vrf_recv_from_instance_manager(5304): instance 0 entering cpcl_vrf_recv_from_instance_manager
routed[13501]: cpcl_vrf_recv_from_instance_manager(5361): instance 0 received fd 26
routed[13501]: cpcl_vrf_recv_from_instance_manager(5452): instance 0 leaving cpcl_vrf_recv_from_instance_manager
kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] Setting CUL FREEZE_ON, high kernel CPU usage (82%) on local Member 0, threshold = 80%
kernel: [fw4_1];fwha_hp_periodic_run: Policy has ended 120 seconds ago. Returning to regular timeouts


Solution

1. Non action was taken after incident. Cluster rebuilt itself
2. Both firewalls was rebooted
3. Cluster was upgraded to 77.30 (previously in 77.20).
4. Cluster failover was tested with success