What is the failure rate of cloud instances at Amazon, Azure, Google?
I have looked for specific numbers – but so far found just aggregate “nines” for cloud regions or availability zones. So my anecdotal response is “for the most part, a REAL long time”. It is not unusual for us to find customers’ Cohesive network controllers running for years without any downtime. I think the longest we have seen so far is six years of uptime.Â
So – with generally strong uptimes for instance-based solutions, and solid HA and recovery mechanisms for cloud instances – how much premium should you spend on some of the most basic “cloud platform solutions”?
Currently cloud platforms are charging a significant premium for some very basic services which do not perform that differently, and in some cases I would argue less well than instance-based solutions; either home-grown or 3rd-party vended.Â
Let’s look at a few AWS examples:
- NAT-Gateway 4.5 cents per hour plus a SIGNIFICANT data tax of 4.5 cents per gigabyte
- Transit Gateway VPC Connection 5 cents per hour for each VPC connection plus a HEALTHY data tax of 2 cents per gigabyte
- AWS Client VPN $36.50 per connected client (on a full-time monthly basis), $72 per month to connect those VPN users to a single subnet! (AWS does calculate your connected client costs at 5 cents per hour, but since we should all basically be on VPNs at all times, how much will this save you?)
NOTE: The items I call “data taxes” are on top of the cloud outbound network charges you pay (still quite hefty on their own).Â
If you are using cloud at scale, depending on the size of your organization, the costs of these basic services get really big, really fast. At Cohesive we have customer’s that are spending high six figures, and even seven figures in premium on these types of services. The good news is for a number of those customers it is increasingly “were spending”, as they move to equally performant, more observable, instance-based solutions from Cohesive.
Here is a recent blog post from Ryan at Cohesive providing an overview of Cohesive NATe nat-gateway instances versus cloud platforms. For many, a solution like this seem to meet the need.Â
Although – I think Ryan’s post may have significantly underestimated the impact of data taxes. https://twitter.com/pjktech/status/1372973836539457547
So you say “Yes, instance availability is really good, but what about [fill in your failure scenarios here] ?”
Depending on how small your recovery windows need to be, there are quite a range of HA solutions to choose among. Here are a few examples:
- Protect against fat-finger termination, automation mistakes with auto-scale groups of 1, and termination protection
- Use AWS Cloud Watch and EC2 Auto Recovery to protect against AWS failures
- Run multiple instances and add in a Network Load Balancer for still significant savings
- Use Cohesive HA plugin allowing one VNS3 Controller instance to take over for another (with proper IAMs permissions)
Overall, this question is a “modern” example of the “all-in” vs. “over-the-top” tension I wrote about in 2016 still available on Medium. More simply put now, I think of the choice as being when do you run “on the cloud” and when do you choose to run “in the cloud”, and ideally it is not all or none either way.
In summary, given how darn good the major cloud platforms are at their basic remit of compute, bulk network transport, and object storage, do you need to be “in the cloud” at a significant expense premium, or can you be “on the cloud” for significant savings at scale for a number of basic services?