Its been an interesting morning

js10@reddthat.com · 1 year ago

Its been an interesting morning

merc@sh.itjust.works · 1 year ago

you can achieve a lot with a live/live system, or a 3 node system with a master election, or…

“A lot”, sure, but not say 5 nines. 99.9% (8 hours of downtime per year), is reasonable. That’s enough time to fire up an instance in another location if that turns out to be necessary.

99.99% (50 minutes of downtime per year) is harder. It means you need automatic systems doing the switchover, geographical separation, people on call 24/7 to diagnose and fix any issue in minutes.

99.999% is only 5 minutes of downtime per year. At that rate, you can’t even afford for someone on call to respond. You do still want them on call to verify the automated systems did the work, but you need to rely on automated systems fully handling any possible emergency. The system needs to fail over perfectly without any human intervention. For that, a 3 node system isn’t enough. You need geographical redundancy, as well as redundancy within each geographic region. You need to be able to do software upgrades without affecting that redundancy, so you need at least a secondary 3-node system so that you can do a blue/green deployment, testing out handing over traffic to the new system with the ability to instantly roll back if something doesn’t work.

Each “nine” you add reduces the “error budget” by a factor of 10, so as you start getting above 4/5 nines, you really do start to need specialized engineering which tends to come with high cost and complexity.

For a typical Lemmy instance, 3 nines is probably good enough. 2 nines might even be acceptable if people aren’t paying. But, for something like Netflix, 8 hours of downtime per year is far too much. For something like a high frequency trading platform, 8 nines might not even be enough. For them, the custom engineering and obscene cost of chasing 7+ nines is worth it because every second of downtime could cost millions.

wim@lemmy.sdf.org · 1 year ago

Agreed, but for many services 2 or 3 nines is acceptable.

For the cloud storage system I worked on it wasn’t, and that had different setups for different customers, from a simple 3 node system (the smallest setup, mostly for customers trialing the solution) to a 3 geo setup which has at least 9 nodes in 3 different datacenters.

For the finanicial system, we run a live/live/live setup, where we’re running a cluster in 3 different cloud operators, and the client is expected to know all of them and do failover. That obviously requires little more complexity on the client side, but in many cases developers or organisations control both anyway.

Netflix is obviously at another scale, I can’t comment on what their needs are, or how their solution looks, but I think it’s fair to say they are an exceptional case.