Businesses of all sizes are relying on their local ISP to provide a reliable connection to countless applications in the cloud, whether they are profession specific such as electronic health records for doctors, or more general applications that virtually any business may use such as cloud-hosted voice and video.
When network quality issues arise, all parties get a shared headache. The customer’s business is impacted. The SaaS provider (e.g. hosted VoIP provider) is forced to defend their service. And depending on the customer an MSP or IT Solutions Provider may be trying to juggle it all. Add in the ISP and the finger-pointing-party is now in full swing.
Unfortunately, despite being in an age of virtual reality and self-driving cars, troubleshooting Internet connections is still stuck in the 1990’s and relegated to legacy point-in-time troubleshooting tools like ping and traceroute. And while enteprise solutions exist, they are often many hundreds of dollars a month per site and require in-depth technical knowledge and a steep learning curve.
At Firebind, we knew we could bridge the gap between the two extremes of ad hoc ping tests on one end and “luxury” (expensive & complicated) solutions on the other end with a price-point that puts it within reach of virtually any business. Users can configure and deploy an agent in just a few minutes, whether they are a direct end-user, a SaaS business or MSP deploying in partnership with end-customer, or even an ISP.
So How Does It Work?
Firebind is a monitoring solution that uses active traffic at regular intervals to paint a picture of network quality. All of our measurements are based on packets we put on the wire and then observe either by the agent that sent the traffic or a remote agent (such as one of our public targets) that received the traffic.
Firebind is not a passive solution. We don’t look at NetFlow data, we don’t use port mirrors to monitor packets on the wire, and we don’t pull stats from network devices or aggregate data from other applications.
There are countless passive solutions and they all have one major limitation. They are only as good as the traffic on the wire, and if there is no traffic, there is no monitoring. Getting per-call stats from a VoIP solution is great, but it needs to be complimented by an active solution like Firebind because if no one is making calls from 9pm to 8am, there isn’t anything to passively monitor. Because Firebind runs its test suite every 5 minutes, 24 hours a day, we have the same detailed view at 3 in the morning that we do at 3 in the afternoon.
What Does Firebind Measure?
Firebind measures 11 different metrics every 5 minutes, generating 3,168 data points per day per agent.
Measurements To One of Firebind’s Public Test Targets
- Upload and Download Packet Loss based on a simulated G.711 VoIP call (87 Kbps, 50 pps, 218 byte payload)
- UDP Client and Server Jitter
- UDP Round-Trip-Time (20 UDP packets)
- Average Ping Round-Trip-Time (20 ping packets)
User Configured Measurements
- Average and Maximum Ping Round-Trip-Time to a user-specified destination
- HTTP Response time to a user-specified destination
- DNS Response time of a user-specified domain name to
- The DNS server specified on the agent host machine
- Google’s public DNS server (220.127.116.11)
What Is Firebind's Architecture?
There are 3 main components to Firebind. The Firebind cloud interface (aka console), public test target agents that are managed by Firebind, and customer “initiator” agents that are deployed at remote sites.
All Firebind managed components are hosted at AWS. The console is hosted at US-East-1 in Virginia and leverages Amazon’s highly available and MySQL compatible Aurora database offering. Both the console and Firebind hosted test targets use the latest generation C4 compute optimized EC2 instances to ensure accurate packet loss measurements and eliminate false positives that could be generated by underpowered processors. Firebind test targets are currently available in US-East-1 (Virginia) and EU-West-1 (Dublin, Ireland). Future test targets will be added based on customer demand.
How Do I Know When There Are Network Quality Issues?
Firebind operates on a very simple premise… Take measurements every 5 minutes to establish a performance baseline so that when problems arise, the issue(s) will be easy to identify. Six different graphs represent the 11 measurements gathered every 5 minutes, with a horizontal dashed line representing the user-configured alarm threshold. If a measurement exceeds that alarm threshold, not only will the user be able to see it visually on the graph, but an email can be generated to a list of recipients.
Packet Loss Graph
In the above 1-day chart of UDP packet loss we’re testing a simulated G.711 VoIP call for 25 seconds in each direction from a Firebind agent in Boston to the Firebind hosted target agent at AWS US-East-1 in Virginia. We see minimal packet loss until 9am when upload loss steadily increases to over 10%, then drops to 0% at 12:20. Ideally this chart should show a flat line at 0% with at most only “blips” of transient loss a few times a day. Because the synthetic G.711 VoIP call sends 50 packets-per-second, this test can see loss events not visibible when using ping’s traditional rate of 1 ping packet per second. If ping is a magnifying glass, Firebind is a microscope. Note that while certain versions of ping can be configured to send many packets per second, network providers will generally throttle that increased-rate ping traffic, making it unreliable for troubleshooting purposes. And although this test uses UDP, the fact that UDP packets are being dropped indicates the likelihood that TCP packets are also being dropped, leading to TCP slowdowns due to retransmissions.
In the above chart we’re testing client and server jitter from a Firebind agent in Boston to the Firebind hosted target agent at AWS US-East-1 in Virginia. The performance is highly acceptable with only 2 instances of client jitter exceeding the 6ms alarm threshold.
UDP Latency/Ping Round-Trip-Time Graph
In the above chart we’re measuring both ping and UDP-based RTT (round-trip-time) from a Firebind agent in Boston to the Firebind hosted target agent at AWS US-East-1 in Virginia. The reason for conducting this test using both ping and UDP is to see how each transaction is treated by the network. Since the UDP measurement is from Firebind agent to Firebind agent, and the ping measurement is from Firebind agent to network interface, we’d expect the ping measurement to be slightly faster by a few milliseconds since the Virginia target agent doesn’t play a role in the ping response. Instead, the ping RTT is not only greater than the UDP RTT, but the low-to-high range of the average ping RTT is 15+ milliseconds when it should ideally be 1 to 2ms. This is a telltale sign of a busy network connection since it’s clear that UDP is being prioritized over ping. This chart also shows us what appears to be a routing change whereby the round-trip-time roughly doubles from ~25ms to ~50ms.
Ping Round-Trip-Time To Customer Configured Target Graph
In the above chart we’e looking at the average and maximum ping RTT (round-trip-time) from a Firebind agent in Boston to a customer configured destination. In this case we’re pinging the first network element outside of the agent location, which happens to be the IP address of the CMTS (Cable Modem Termination System), the device that aggregates all of the ISP cable connections in this neighborhood. Firebind sends 20 total pings. In this case the maximum (orange line) should ideally track within a few milliseconds of the average (purple line). However, we can see an average of ~20 ms to ~25ms but with maximum values approaching and exceeding 250ms, a 10x difference. This is characteristic of last-mile connections that are oversubscribed. Note the red-dashed alarm threshold line indicating that for the time period of this chart, a single alarm would have been generated when the average round-trip-time exceeded 50ms just before 18:00.
HTTP Response Time
In the above chart we’re looking at the HTTP response time from a Firebind agent in Boston to a customer configured HTTP destination. Most values fall between 100ms and 200ms, with 2 instances crossing the alarm threshold of 300ms. The HTTP destination can be something common like google.com or could be a URL for a service that the end-user accesses from the site where the agent is deployed. It’s important to note that the range can widely vary due to the less predictable nature of how long it takes the destination server to respond to the HTTP request.
DNS Response Time
In the above chart we’re looking at DNS response time – the time it takes for a DNS query to complete. All computing instances rely on local DNS settings. Those settings are frequently configured using the ISP’s DNS servers. Firebind does two DNS lookups every test interval – one to the agent host machine’s DNS server, and the other to Google’s DNS server at 18.104.22.168. Since DNS is such a critical part of the operation of almost any application, monitoring the performance of the user’s primary DNS server can provide valuable insights into whether network quality issues are related to congestion or a performance issue specifically due to DNS resolution. This chart shows the local DNS server lookup to be consistently lower than the Google DNS server lookup, which is generally to be expected. However just before 2am the local DNS lookup spiked to ~175ms, exceeding the Google lookup time and crossing the 150ms local DNS alarm threshold.