How the BGP Routing Changes Can Impact Online Application Performance

While monitoring online banking services, we came across an interesting event proving that if you really want to understand why things break, it’s not enough to look at what’s happening on the application layer’s surface.

Application View

For this test, we begin in the application view, shown below. We see a dip in availability from various locations around the world, with the red circles indicating the region’s availability issues have occurred. To understand what and why this happened, we need to look at some of the other views inside.

Figure_1_Availability_drops_to_64percent

Figure 1: Availability to on-line banking services drops to 64%.

Path Visualization View

Next, we move to the Path Visualization view to understand where the loss is occurring. The figure below shows the routes from an agent in Phoenix on the left to the node where the path ends on the far right. Interfaces with significant loss will be circled in red. When an interface is selected, it becomes a dashed line, and the information our agents gathered is displayed in a box when you hover over (Figure 2)

Figure_2_Routes_from_Pheonix_to_point_terminate

Figure 2: Routes from Pheonix to Ancestry.com terminate

The Phoenix agent probes all terminated in this single location, resulting in 100% packet loss inside the XO Communications network. During the test, we had multiple agents probing this site to look into this event in greater detail, and we saw an interesting pattern emerge. We have 5 agent locations in this test, all exhibiting similar behavior, 100% packet loss inside of a single network, XO Communications.

Figure_3_All_affected_locations_have_routes_terminating

Figure 3: All affected locations have routes terminating in XO Communication’s network

Why are all these interfaces inside of the XO Communications network dropping packets? To answer this question, we look at another view, the BGP Route Visualization.

Figure-4_BGP_routes_are_revoked_AS2828

Figure 4: BGP routes are revoked between AS36175 and XO Communications AS2828

Before we get into what happened in the example, let’s go through what we’re looking at in the figure above. Each BGP Autonomous System (AS) is assigned its own unique Autonomous System Number (ASN) to rout the Internet. In this example, we have three ASNs in this view: AS 2828, registered to XO Communications, AS 31993 American Fiber Systems, Inc., and the AS 36175 myfamily.com, Inc. You can hover over the individual ASN to get information about each AS.

Destination networks are shown in green in this view, which in this case, is myfamily.com (Ancestry.com), the site we were monitoring. Intermediary ASNs in the path from the origin to the monitor are shown in grey-shaded circles, with the network’s autonomous system (AS) number shown inside the circle. In this case, the transit networks are AS 31993 American Fiber Systems and AS 2828 XO Communications.

The smaller circles with location names represent BGP routers that export their BGP best paths to data collectors. We also call these routers monitors. The label “3” in the path between Vancouver and AS 2828 indicates three AS hops between that monitor and the AS 2828 network. Dotted red lines represent links that were used as the best path at some point during the 15-min time window the topology refers to but are no longer used at the end of the bin. In this case, the link to the upstream XO (AS2828) stopped being used in favor of AS31993.

We can now understand why we saw a 100% packet loss inside of the XO Communications network. A BGP route change occurred, and as a result, there were no longer routes available via XO. However, packets were still being forwarded to XO Communications due to BGP convergence delay. When this occurred, traffic en route to the myfamily.com AS would not have been able to reach its destination.

Conclusion

When things break, you really want to understand where, when, and why the issue occurred. You need to look at forwarding paths, dig into BGP, and correlate that with actual application behavior to really understand why that dip in availability occurred in the first place.