On-Prem

This article is more than 1 year old

Unexpected MySQL database meltdown fingered in GitHub's 24-hour website wobble

Days since last TITSUP (Total Inability To Support Users' Pulls) reset to zero

Tue 23 Oct 2018 // 04:02 UTC

Updated Programmers, your snow day is well and truly over: GitHub's website has finally cleared its 24-hour outage, and reckons everything is operating normally again.

Last year, CEO Chris Wanstrath said the company was shooting for zero downtime, or at least five-nines of uptime. Well, that comes out at about five minutes of non-service per year, and it's safe to say we've blown past that.

Microsoft's hoped-for $7.5bn acquisition first reported “elevated error rates” on its website at 4pm US Pacific Time on Sunday, followed by intermittent service until an “all clear” message arrived almost exactly 24 hours later at 4pm PT Monday. For UK users, that was an all-of-Monday outage, since it ran 11pm Sunday to 11pm Monday; Australian users barely had time to log in on Monday before the site tripped over at 10am.

As we reported yesterday, the backend git services were working, but the website was frozen in time, serving out-of-date code repos and ignoring submitted material, Gists, and bug reports.

GitHub.com freezes up as techies race to fix dead data storage gear

The collapse was attributed to a data storage system that died, understood to be one or more MySQL database servers. Now things have returned to normal, GitHub's incident report explained the problems, which started with “a network partition and subsequent database failure resulting in inconsistent information being presented on our website.”

To stop the errors propagating to repositories, GitHub said it decided to pause webhook events “and other internal processing systems.” That much worked, at least: the incident report claimed the outage “only impacted website metadata stored in our MySQL databases, such as issues and pull requests. Git repository data remains unaffected and has been available throughout the incident.”

An hour after the databases started playing up, admins tried failing over to a backup data storage system, but that didn't work. Three hours after the site went off the rails, the status message changed from announcing a migration of the data storage systems, to “we continue to work to repair” the knackered storage backend.

The restoration of its databases took “longer than we anticipated,” and after that, the repair work had to be validated, and a huge backlog of events – Pages builds and webhooks, for instance – had to be processed.

The last “error” message was posted 1523 Monday PT (2223 UTC, 0923 Tuesday AEST), and the welcome “everything operating normally” arrived 40 minutes later. ®

Updated to add

The MySQL database meltdown was sparked by a dodgy network link, which left data stores in inconsistent states.

Topics

Special Features

Vendor Voice

Resources

On-Prem

Unexpected MySQL database meltdown fingered in GitHub's 24-hour website wobble

Days since last TITSUP (Total Inability To Support Users' Pulls) reset to zero

GitHub.com freezes up as techies race to fix dead data storage gear

Updated to add

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

Microsoft slammed for lax security that led to China's cyber-raid on Exchange Online

US government excoriates Microsoft for 'avoidable errors' but keeps paying for its products

Microsoft unbundling Teams is to appease regulators, not give customers a better deal

Reducing the cloud security overhead

Microsoft breach allowed Russian spies to steal emails from US government

Want to keep Windows 10 secure? This is how much Microsoft will charge you

Microsoft squashes SmartScreen security bypass bug exploited in the wild

Microsoft puts ex-DeepMind boffin in charge of London AI hub

Microsoft, OpenAI may be dreaming of $100B 5GW AI 'Stargate' supercomputer

French lawmakers take a swing at cloud monopolies

Microsoft thinks bundles are great and customers love them

Why Microsoft's Copilot will only kinda run locally on AI PCs for now

About Us

Our Websites

Your Privacy