Google says that the global authentication system outage which affected most consumer-facing series on Monday was caused by a bug in the automated quota management system impacting the Google User ID Service.
This worldwide system failure prevented users from logging into their accounts and authenticate to all Cloud services.
As a direct result, users weren't able to access Gmail, YouTube, Google Drive, Google Maps, Google Calendar, and several other Google services for almost an hour on Monday, December 14th.
During the outage, users could not send emails via Gmail mobile apps or receive email via POP3 for desktop clients, while YouTube visitors were seeing error messages stating that "There was a problem with the server (503) - Tap to retry."
Outage impact and root cause
"On Monday 14 December, 2020 from 03:46 to 04:33 US/Pacific, credential issuance and account metadata lookups for all Google user accounts failed," Google explained. "As a result, we could not verify that user requests were authenticated and served 5xx errors on virtually all authenticated traffic.
"The majority of authenticated services experienced similar control plane impact: elevated error rates across all Google Cloud Platform and Google Workspace APIs and Consoles."
The root cause behind the outage was decreased capacity for Google's central identity management system due to a bug impacting the automated quota management system.
This led to issues verifying that Google user requests were authenticated, resulting in errors being displayed on all authentication attempts.
Global identity management system
The Google User ID Service that was at the root of the major Google outage from Monday stores unique identifiers for all Google accounts and it manages authentication credentials for both OAuth tokens and cookies.
It also stores user account data in a distributed database, which makes use of Paxos protocols to coordinate updates during authentication.
Since the User ID Service service will reject requests when detecting outdated data for security reasons, all customer-facing Google services requiring Google OAuth access became unavailable right after the service started experiencing issues and began issuing outdated identifiers.
"Google uses an evolving suite of automation tools to manage the quota of various resources allocated for services," the company said in an issue summary report published today.
"As part of an ongoing migration of the User ID Service to a new quota system, a change was made in October to register the User ID Service with the new quota system, but parts of the previous quota system were left in place which incorrectly reported the usage for the User ID Service as 0.
"An existing grace period on enforcing quota restrictions delayed the impact, which eventually expired, triggering automated quota systems to decrease the quota allowed for the User ID service and triggering this incident."
Although safety checks are set in place to prevent unplanned quota changes, they weren't able to properly react to the scenario of zero reported loads single service.
"As a result, the quota for the account database was reduced, which prevented the Paxos leader from writing," Google added. "Shortly after, the majority of read operations became outdated which resulted in errors on authentication lookups."
Google said that this major outage also affected the company's internal users and tools, causing delays during the outage investigation and the reporting of status updates.
Gmail affected by a second outage within a single day
Gmail was affected by a second outage for a combined total of roughly 7 hours after the authentication issues were resolved on Monday, an outage that affected a subset of Gmail users who experienced email delivery issues.
"The error message indicated that the email address did not exist, and as a result, the impacted emails were never delivered," Google said in another report published today. "Affected senders may have received a bounce email generated by an intermediate SMTP service."
"In some cases, the full SMTP error message was quoted in the bounce email. The behavior of these messages depended on the external SMTP clients connecting to the Google SMTP service."
The cause of this second outage was an ongoing migration to update the underlying configuration system of Gmail's SMTP inbound service.
"A configuration change during this migration shifted the formatting behavior of a service option so that it incorrectly provided an invalid domain name, instead of the intended 'gmail.com' domain name, to the Google SMTP inbound service," Google said.
"As a result, the service incorrectly transformed lookups of certain email addresses ending in "@gmail.com" into non-existent email addresses.
"When the Gmail user accounts service checked each of these non-existent email addresses, the service could not detect a valid user, resulting in SMTP error code 550."
Post a Comment Community Rules
You need to login in order to post a comment
Not a member yet? Register Now