The Google Outage and Dreaded “404: Not Found” Error
When trying to access a website that is not loading, you may ask yourself, “Is this issue only happening on my device?”. The other day, you were not alone.
Many prominent websites that rely on the Google Cloud Platform experienced a brief outage in which their web pages returned “404: Not Found” errors. This raises two important questions. How did this disruption happen, and what are the implications for online businesses and customers?
The Google Cloud Platform Outage
The Google Cloud Platform, with infrastructure available in over 200 countries and territories, is trusted by many of the world’s leading retail, financial services, healthcare, and media companies. It delivers over 90 information technology services, such as computing, networking, storage, and databases.
According to a statement from Google, the culprit was cloud networking interference. At a high level, cloud networking provides connectivity to and between applications and on-premises, edge, and cloud-based services. After a preliminary root cause analysis, Google determined that a latent bug in a network configuration service was triggered during routine system operations.
Let’s Explore in More Detail
Many popular websites and services, such as Snapchat, Etsy, and Home Depot, were down for approximately 1 hour and 49 minutes on Tuesday, November 16th due to a Google cloud networking issue. Referencing the statement from Google, Google Cloud Networking experienced issues with Google Cloud Load Balancing (GCLB).
Load balancing is the methodical distribution of network or application traffic (incoming requests from client devices) to multiple backend servers depending on which are more capable to fulfill those requests.
With over 80 load balancing locations worldwide, GCLB supports more than 1 million queries per second, with the intent of high performance and low latency. However, on Tuesday, there was a disruption in this flow.
How Were Online Businesses and Customers Impacted?
The GCLB service interruption impacted several downstream Google Cloud services. As a result, affected web pages returned 404 errors, indicating that the requested page was not available. In addition to experiencing this error, many website owners were unable to make changes to their website load balancing and observed a decrease in site traffic.
Blue Triangle observed that many sites only experienced a sporadic end user impact. For example, irregular performance and revenue indicators during the incident that are atypical of the normal sales process.
One-Off Incident or Frequent Trend?
In recent years there has been global growth in cloud services and greater cloud adoption worldwide. There has also been no shortage of headlines on cloud outages and issues, and how it affects online businesses and their customers.
For example, in December 2020, the Google Cloud Platform also suffered a major outage, affecting the company’s technical support and ability to connect with customers externally. Just one month prior, Amazon Web Services (AWS), one of the most popular cloud computing services in the world, experienced a prolonged, large-scale outage. Like Google Cloud, AWS is the backbone of many websites and applications. The ripple effect was widely felt across the internet. In 2020, Microsoft users were also impacted by a series of problems that crashed Azure, the company’s cloud computing service for application management.
The reliance of services on other providers can help website owners offer content at a massive scale, however, if there is a problem, then it may have far-reaching implications. This was also the case during the six-hour outage that occurred last month that severely impacted Facebook, now known as Meta Platforms. Read How Facebook Broke the Internet to learn more about protecting your site’s revenue from social media platform outages.
The Bottom Line
Companies that solely rely on cloud services for their infrastructure may be putting their online business at risk for loss of revenue, security vulnerabilities, and poor site performance and customer experience in the event that service availability is compromised.
Through proactive anomaly detection and SLA thresholds, you may be better positioned to manage outages and mitigate risk. Be alerted to the problem before the customer journey is negatively impacted and customers raise awareness of an unpleasant buying experience.
As a best practice, it is important to monitor your domains in real-time and deploy a tag governance approach that defers tags later in the page load cycle to avoid site-blocking issues. It is also crucial to add first and third-party tags to your web pages in a way that will not cause the site to completely shut down.
Learn more about the Blue Triangle Content Security Policy Manager to help proactively protect your site’s performance and revenue during an outage.