Network: Intermittent Packet Loss

Incident Report for 23M

Postmortem

Rough outline of the issue

On 27.02.2024 at about 09:35 UTC we noticed a huge amount of problem reports about packet loss in our network. After a quick investigation we noticed that the rpd (routing process) on one of our redundant gateway routers started using too much RAM. Reboots of the redundant gateways solved the issue only for a couple of hours and even after further debugging we could not find any issues on our side causing this.

Here is a picture from our monitoring system showing the rapid rise of used RAM:

‌

As we could not solve the situation on our own and suspected a software bug (memory leak) we contacted the Juniper support. After first investigation with Juniper we deactivated the second gateway and the situation normalized. The network was stable again but we were missing a critical element in our backbone.

During the next days and nights we had to gather debug information from the device in the faulty state. For this we had to bring the device back into the problematic state and had to collect several log files and debug information.

After several debug sessions, Juniper finally found a bug which was connected to our problem (Link to the PR which was just released today). In an EVPN-MPLS setup when there is a MAC address having more than 300 IPs bound to it can cause high CPU and RAM load on the system. In our case this lead to packet loss.

In order to solve the problem we requested an action plan from Juniper. We had to

contact the customers using more than 300 IPs bound to a single MAC address
have them reduce the number of bound IP addresses
gather further debug information on FPC level

We carried out the action plan immediately and could re-integrate the problematic gateway router into our network on 11.03.2024.

Implementation of the final fix

We received the information from Juniper on 15.03.2024 that the issue has been fixed and we received a list of versions containing the fix. We scheduled a maintenance window for 18.03.2024 at 23:00 UTC.

Unfortunately, we had to prepone the maintenance as the CPU load started to rise over the weekend. The final fix was implemented on 18.03.2024 between 10:00 and 12:00 UTC.

Here you see a graph from our monitoring system showing the CPU load rise:

Lessons Learned

Even if the PR was classified as minor you can see that a minor bug can have huge impact on a network. In our case we were hit by a software bug which we could not solve on our own - our hands were bound.

During this incident we have noticed one point: We will further optimize our communication with our customers as this still has some potential for improvement.

Posted Mar 21, 2024 - 11:24 UTC

Resolved

We are not seeing any problems reoccuring. We are setting this issue to resolved and continue to closely monitor the situation.

We are still in touch with Juniper to put together all their findings. We will send out a postmortem with further details this week.

Posted Mar 11, 2024 - 12:59 UTC

Monitoring

After further information from Juniper we have reintegrated the other gateway router back into our network. Since 30 minutes the situation is stable. We do not see any of the problematic circumstances we have seen before. We will continue to monitor the situation in detail.

Posted Mar 11, 2024 - 11:04 UTC

Update

We have received another update from Juniper. In combination to the existing PR1782710 another PR has been raised (PR1764487) which will introduce a new configuration knob to finally solve this problem.

In the meantime, we have made some adjustments with some customers which we were told to do. We plan to reintegrate the other gateway in the coming week.

Posted Mar 08, 2024 - 11:18 UTC

Update

We had another call with Juniper this morning. This problem has now been clearly identified and is issued under PR1782710 which is right now not yet publicly available. We also have not yet seen any details from the PR but have a brief explanation:
In EVPN-MPLS networks where many IPs are bound to a single MAC under certain circumstances the rpd may get stuck. This only happens when all-active multihoming is used and two or more PE-Devices are active. The problem is related to the l2ald process.

We now have an action plan and will contact certain customers who may have been identified as causers of this problem. A longterm fix will be provided by Juniper later as engineering is now also involved.

In the actual state the network is stable and we do not have to expect any issues.

Posted Mar 07, 2024 - 10:36 UTC

Update

We have a new reply from Juniper. Our issue symptoms and collected logs from last night point to PR1782710. The PR is not public and we have not yet received further information. Engineering is currently cross-checking our issue.

Posted Mar 06, 2024 - 13:39 UTC

Update

We received some further information about our issue. The live-coredumps from our equipment do not point to any existing PR. We will have to schedule another debugging session for Tuesday to Wednesday night. During this timeframe we will have to bring our equipment back to the original error state. You will notify this as packet loss will occur. This debugging session will be announced separately.

Posted Mar 04, 2024 - 09:17 UTC

Update

According to latest information Juniper ATAC will further analyze the issue over the weekend. We expect a major update on Monday regarding this case.

Posted Mar 01, 2024 - 16:38 UTC

Update

We are right now waiting for the new case owner from Juniper ATAC. As soon as we receive new information we we pass it on.

Posted Mar 01, 2024 - 13:07 UTC

Update

Juniper is still debugging the issue. Their last update points towards a software issue/bug within one of our routing instances used for EVPN.

Posted Mar 01, 2024 - 07:35 UTC

Update

We have provided further debug information upon request. So far it points to a memory leak in the rpd process. We will update again once we receive new information.

Posted Feb 29, 2024 - 18:35 UTC

Update

We have received some further information from Juniper. They are still debugging and checking for existing PRs.

Posted Feb 29, 2024 - 15:51 UTC

Update

We are continuing to work on a fix for this issue.

Posted Feb 29, 2024 - 11:12 UTC

Update

We have received an update from Juniper. They are still checking the debug information and will start to check the taken core-dumps from rpd soon. They expect that we have hit a serious memory leak.

Until the situation is solved we will not be able to activate any new customer services relying on network changes to our platform.

The situation is stable.

Posted Feb 29, 2024 - 11:11 UTC

Update

Last night the vendor has collected all necessary debug information. Due to the complexity of this issue they could not give us further information and will have to analyze all collected data. Traffic stays re-routed until we receive further information. We do not expect any further outage.

Posted Feb 29, 2024 - 08:10 UTC

Update

Debugging with the vendor will resume at 23:00 UTC.
Despite our best efforts it will not be possible to avoid short periods of further packet loss from occurring. Please excuse any inconvenience.

Until then traffic will stay re-routed and the network is stable.

Posted Feb 28, 2024 - 10:53 UTC

Update

We will reactivate the affected gateway router with a different configuration.
If the problem persists, packet loss could occur again.

Posted Feb 28, 2024 - 08:17 UTC

Update

We are working with the vendor to debug the issue, which will resume tomorrow.
Traffic is currently re-routed and you should see no disruption anymore.

Posted Feb 27, 2024 - 14:18 UTC

Update

We are continuing to work on a fix for this issue.

Posted Feb 27, 2024 - 11:23 UTC

Identified

We are investigating reports of intermittent packet loss to some IPs in our network and have identified high memory usage on one of our gateway routers as the source.
We will be re-routing traffic until the cause can be fixed.

Posted Feb 27, 2024 - 11:10 UTC

This incident affected: FRA01 - Telehouse (Network).