As with every aspect in your sphere of influence, you cannot manage what you do not measure. It is critical in today’s world of cloud and distributed architecture that we measure the system holistically, which includes directly measuring the key elements of your infrastructure and network components.
Common issues (such as flaky network ports in data centers, under-provisioned load balancers, and globally distributed services) can cause outages and micro-outages that may only surface once systems are under moderate or peak load levels.
It is vital for Dev/Ops and Network Ops teams to have visibility to what is going wrong, even if the issue only lasts for a few seconds. The telemetry should be smart enough to not only detect issues, but also to provide guidance to know which hop along the network is causing high packet loss, excessive latency, jitter, or other network problems.
Talk to enough teams and you’ll hear the following process:
Now imagine if the telemetry always provided the MTR all the time, streamlining the above process to one simple, time-efficient step, significantly cutting down MTTR (mean time to repair):
Network Health Checks enable you to run ping and traceroute-style testing to monitor issues such as latency, packet loss, and jitter across your network infrastructure. But here at Blue Triangle, we’ve been busy taking Network Health Checks to the next level- using My Traceroute (MTR). So, what does that mean exactly?
Network Health Checks are configurable synthetic monitors that can run with Internet Control Message Protocol (ICMP), Transmission Control Protocol (TCP), User Datagram Protocol (UDP), or Stream Control Transmission Protocol (SCTP). These protocols determine how data is communicated and what kind of packets are sent. This offers engineers flexibility in the types of tests sent through the network. You can configure additional monitor options such as max number of hops, domain/host to measure, report cycles, and error state tracking.
With typical latency or traceroute measurements, if there is an issue, a DevOps engineer will need to log in to see where in the route the packet loss is happening, which takes time and resources, and can also lead to false positives.
With our system using MTR, you can detect micro-outages, and know what part of the network failed during the time period. Because of this, we get repeatable results that don’t require verification from network personnel, and we catch the culprit in the path- which node along the way had the failure. Using MTR in this manner means more definitive results for understanding where the problem is in the network and a resolution.