Hacker News new | past | comments | ask | show | jobs | submit login

Got this email just now.

- - - -

Hello,

We’re contacting you about an ongoing outage with the Mandrill app. This email provides background on what happened and how users are affected, what we’re doing to address the issue, and what’s next for our customers.

What happened Mandrill uses a sharded Postgres setup as one of our main datastores. On Sunday, February 3, at 10:30pm EST, 1 of our 5 physical Postgres instances saw a significant spike in writes. The spike in writes triggered a Transaction ID Wraparound issue. When this occurs, database activity is completely halted. The database sets itself in read-only mode until offline maintenance (known as vacuuming) can occur.

The database is large—running the vacuum process takes a significant amount of time and resources, and there’s no clear way to track progress.

Customer impact The impact to users could come in the form of not tracking opens, clicks, bounces, email sends, inbound email, webhook events, and more. Right now, it looks like the database outage is affecting up to 20% of our outbound volume as well as a majority of inbound email and webhooks.

What we’re doing to address this We don’t have an estimated time for when the vacuum process and cleanup work will be complete. While we have a parallel set of tasks going to try to get the database back in working order, these efforts are also slow and difficult with a database of this size. We’re trying everything we can to finish this process as quickly as possible, but this could take several days, or longer. We hope to have more information and a timeline for resolution soon.

In the meantime, it’s possible that you may see errors related to sending and receiving emails. We’ll continue to update you on our progress by email and let you know as soon as these issues are fully resolved.

What’s next We apologize for the disruption to your business. Once the outage is resolved, we plan to offer refunds to all affected users. You don’t need to take any action at this time—we’ll share details in a follow-up email and will automatically credit your account.

Again, we’re sorry for the interruption and we hope to have good news to share soon.




If you're looking for a good series of blog posts about xid wraparound in Postgres check out these posts by Josh Berkus:

http://www.databasesoup.com/2012/09/freezing-your-tuples-off...

http://www.databasesoup.com/2012/10/freezing-your-tuples-off...

http://www.databasesoup.com/2012/12/freezing-your-tuples-off...

And this more recent one by Robert Haas:

https://rhaas.blogspot.com/2018/01/the-state-of-vacuum.html

As Josh states at the end of the third post the current best practices for dealing with this are really workarounds and as Robert states it requires monitoring and management. Postgres is an amazing piece of software and managing this is doable but IMHO this is one of Postgres' worst warts. It would be awesome if someone could donate some funding to improve this.


My admittedly very superficial understanding of this issue is that the most common way to run into the xid wraparound problem is tuning the autovacuum in the wrong direction. So you notice that vacuum is taking up a lot of your servers resources, and decrease the frequency. Or you notice that it can't really keep up, but don't tune it to be more aggressive or provide enough resources for it to do its job. Or you don't monitor this at all, which is a pretty bad idea if you do billions of transactions (with less you can't really hit this issue).

This is also a problem that gets far harder to fix once you've run into it. If you have sufficient transaction volume to potentially hit this, you need to monitor autovacuum and make adjustments early before you get close to the wraparound. If you don't, you suddenly have to perform all the vacuum work at once, blocking that table until it's done.


Why does one shard being down remove all their inbound functionality? I'm struggling to understand the purpose of sharding if you can't pull a node offline and replace it while you deal with the wraparound. Is it part of postgres that if one shard has an issue, the entire cluster goes into read only mode?


> I'm struggling to understand the purpose of sharding if you can't pull a node offline and replace it while you deal with the wraparound.

This is not usually the purpose of sharding though. Having a replica of each node or each block of data (and a good failover system) is what would allow you to pull a node offline with no impact. Though it's worth pointing out, even if they had a replica of the node in this case, the replica would probably experience XID wraparound at the same time so that probably wouldn't help.

Sharding usually means partitioning the data so that different data goes to different nodes. In this case that's consistent with 20% of outbound emails being affected if 1 of the 5 shards is down.

There are definitely some red flags with their usage though, like ideally only 20% of inbound emails and events would be affected as well but they said almost all of them are. And ideally you couldn't get into a situation to begin with where you have one shard getting way more writes than everything else. And of course ideally you're monitoring XIDs and can respond enough in advance. I'd be interested to read a more detailed writeup, though based on some of the comments here about their lack of transparency it seems unlikely that one will be released.



Ha! For once it’s not MongoDB but Postgres. I wonder why the sending is effected though. Can’t they run their service with an empty databse in the meantime?


Seems to possibly be the same issue as that discussed on another frontpage article: https://andreas.scherbaum.la/blog/archives/970-How-long-will... / https://news.ycombinator.com/item?id=19082944


I use Mandrill and haven't received any status email from them.


If you care about scalability and availability simultaneously, I'm not sure in these modern times why you would use a relational database. When they fail, they fail catastrophically and are difficult to recover, as this failure event (and the never-ending stream of failure events posted to HN) demonstrates.

Don't get me wrong--I love relational databases and they are amazing pieces of technology. But they are incredibly hard to "do right" at scale while maintaining availability SLAs.

edit:

I would appreciate if downvoters would explain their decision to downvote, so that if I'm incorrect then I could at least update my beliefs. My position is based on years of experience watching relational databases maintained by professional DBAs catastrophically fail in strange ways, and subsequently taking a long time to recover, causing complete blackouts. And having yet to see such failures in managed NoSQL DBs like DynamoDB.



What is your point? It was a 6 hour brownout, not a 30+ hour blackout. It is very unlikely that this kind of outage will happen again for DynamoDB. How likely is someone else going to run into a transaction wrap around again? If it's such a well-known issue, then presumably it keeps happening to a lot of people.


Transaction wrap around is very well known issue and easy to avoid with autovacumming.

Relational databases are tried and true and we have learned from the failures and have only made the technology better.

There are many use cases from data modeling perspective where a relational db makes more sense than a no sql and you really have to understand the trade offs of consistency and durability too. There will always be a place for both technologies and its not a question of either/or but rather what makes sense for your application in terms of not only system scalability but data scalability.


I'm not saying that you should never use relational databases. But if you are running at a large scale and have tight availability SLAs...then consider not using relational databases.

The fact that transaction wrap around is so well-known is itself a red flag--apparently a lot of people have run into this issue, and yet it keeps being an issue. The blast radius is very large and the recovery is painful, as shown here by Mandrill. You should think twice before accepting that risk if you value your uptime.

If you want to become an expert on all these pitfalls and caveats of running relational databases at scale, at the expense of your availability and customer satisfaction--then by all means continue using relational databases. For many use cases, there are better options with better failure resiliency and recovery stories.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: