Network Health Checks for Full Scope Analysis

Written by Rachel Grant | Oct 4, 2021 8:52:42 PM

Measuring the Health of Your Network to Maximize Uptime

As with every aspect in your sphere of influence, you cannot manage what you do not measure. It is critical in today’s world of cloud and distributed architecture that we measure the system holistically, which includes directly measuring the key elements of your infrastructure and network components.

Common issues (such as flaky network ports in data centers, under-provisioned load balancers, and globally distributed services) can cause outages and micro-outages that may only surface once systems are under moderate or peak load levels.

It is vital for Dev/Ops and Network Ops teams to have visibility to what is going wrong, even if the issue only lasts for a few seconds. The telemetry should be smart enough to not only detect issues, but also to provide guidance to know which hop along the network is causing high packet loss, excessive latency, jitter, or other network problems.

Classic Dev/Ops or Network Ops Process

Talk to enough teams and you’ll hear the following process:

The issue is detected with a simple ping measurement
A team member stops what they are doing – or is disrupted from an alert - and logs into multiple hosts
From different locations, they run MTR to get aggregate traceroute information to find the device or network port that is causing the failures.
If the problem is no longer happening, then it becomes a waiting game until the problem happens again.

Now imagine if the telemetry always provided the MTR all the time, streamlining the above process to one simple, time-efficient step, significantly cutting down MTTR (mean time to repair):

A Team Member is alerted to a Network or Node issue in context to when the micro-outages happened, and they immediately see the MTR results from multiple locations exhibiting the network node that had the problem.

Blue Triangle’s Approach

Network Health Checks enable you to run ping and traceroute-style testing to monitor issues such as latency, packet loss, and jitter across your network infrastructure. But here at Blue Triangle, we’ve been busy taking Network Health Checks to the next level- using My Traceroute (MTR). So, what does that mean exactly?

How Do Network Health Checks Work?

Network Health Checks are configurable synthetic monitors that can run with Internet Control Message Protocol (ICMP), Transmission Control Protocol (TCP), User Datagram Protocol (UDP), or Stream Control Transmission Protocol (SCTP). These protocols determine how data is communicated and what kind of packets are sent. This offers engineers flexibility in the types of tests sent through the network. You can configure additional monitor options such as max number of hops, domain/host to measure, report cycles, and error state tracking.

Where Does MTR Fit Into This?

With typical latency or traceroute measurements, if there is an issue, a DevOps engineer will need to log in to see where in the route the packet loss is happening, which takes time and resources, and can also lead to false positives.

With our system using MTR, you can detect micro-outages, and know what part of the network failed during the time period. Because of this, we get repeatable results that don’t require verification from network personnel, and we catch the culprit in the path- which node along the way had the failure. Using MTR in this manner means more definitive results for understanding where the problem is in the network and a resolution.

View full post