Guest article
From time to time, we like to ask our partners to share their ITOp expertise and best practices with you. Today, Boris Rogier Director of Business Development at Accedian discusses TCP performance for your SaaS and Cloud Applications. Read on!
As Director of Business Development, Boris is responsible for leading innovation around Accedian’s network and application performance solutions for enterprise IT. He applies more than 15 years of IT operations, network, and application development experience to advise organizations across all verticals on best practices to optimize performance in multi-cloud, virtualized, and software as-a-service (SaaS) infrastructure environments. Boris holds business law and economy & finance degrees from EDHEC Business School and Institut d’Etudes politiques de Bordeaux.
Troubleshooting TCP performance in complex IT environments that integrate SaaS and cloud-hosted applications can be quite challenging. SaaS and cloud-hosted applications often degrade because of unhealthy TCP relationships (sessions) between client and servers in physical, SaaS, and cloud infrastructure. The way TCP sessions set up and tear down directly impacts SaaS and Cloud performance, and the user experience, especially if there are reasons to believe that hosts are overloaded and messages are dropped. A persistent increase in the number of TCP zero window (0-Win) events and duplicate acknowledgements (DupAck) are typically good indicators that end-users are suffering from degraded performance. Detecting and solving poor TCP/IP performance impacting SaaS and Cloud application is straightforward and delivers quick resolution to network, server, and application degradations, eliminating dysfunctional relationships from ruining your users’ day, and your own.
Trouble with your TCP performance? Find the root cause in just 6 easy steps.
Finding the root cause of TCP performance issues impacting SaaS and Cloud applications can be challenging and time consuming. The following 6 steps enable you to speed up this process and include some actionable pointers toward finding the “low hanging fruit” when looking for ways to mitigate TCP performance issues and target improvements in SaaS and cloud application user experience.
Step 1. Start by ruling out an overloaded client or server side by taking a look at the number of 0-Win events. If these events are coming in rapidly, you may want to involve the respective desktop or system administrator(s) and have a look at the workload on these hosts.
Step 2. If the number of 0-Win events is close to zero, then most likely the TCP transmission problem is somewhere on the network path between the client and server side. If both are within the same subnet, it should be fairly easy to figure out where the delays and/or drops are coming from. A quick look at the MAC tables from the connected network devices should tell you which devices and interfaces are involved.
Step 3. If the client and server side are not within the same subnet, it means that one or more routers (or something similar) is involved. Start by finding the intermediate subnets, devices, and interfaces by looking at the MAC addresses and routing tables of the designated gateway on the client and server side. This should tell you which other routers and interfaces are actively involved in sending and receiving messages.
Step 4. If it turns out that both MAC addresses are pointing to the same routing device, then most likely that routing device has too many things to do besides routing messages. For example, maybe the device is actually a firewall with (too?) many policies. Perhaps it is a load-balancer running CPU intensive tasks such as intrusion detection and prevention (IDS/IPS), performing SSL offloading, or performing data compression. This is probably a good time to involve the system administrator of these devices.
Step 5. However, if both MAC addresses are pointing to different routing devices, then most likely one or more WAN connections are involved to access cloud or SaaS applications. If redundant, check the load-sharing algorithm on the routers. Modern IP routers and switches support packet-based load sharing. While this is a very effective way of performing load sharing, it may result in some unexpected side effects. Such asymmetric network paths may require additional processing time on the hosts as the order by which messages are received might be changed.
Step 6. Once you have an understanding of the devices and interfaces between the client and server side, start looking at things like CPU and memory utilization, frame drops, CRC errors, buffer overflows, and interface utilization. These are good indicators for figuring out what could have caused packet drops and, therefore, are causing additional delays due to retransmissions.
How can a unified N/APM monitoring solution help you troubleshoot TCP performance?
When you need to perform these steps regularly, consider deploying a wire data analytics monitoring solution. Typically, their topology capabilities support you by automating device discovery between 2 hosts. This is because they translate the contents of MAC and routing tables into a topology map. They can also automate the analysis and reporting of TCP metrics for each session: SYN, SYN-ACK, RST, 0-WIN and more, that allows you to isolate problems quickly, without having to perform manual packet analysis. Learn more on troubleshooting TCP with our 5 Steps to Troubleshoot SaaS Applications using TCP Analysis guide available here.