Cloud Costs and the “Economically Defined Architecture”​

Cloud Costs and the “Economically Defined Architecture”​

FinOps, Cloud Costs, Repatriation, and most recently the in-depth post by Martin Casado and Sarah Wang from a16z on the cloud/cost paradox are all topics in the news reflecting that we have reached the “Holy Cow, I am spending a ton on cloud!” inflection point.

As I discussed in previous posts there are key distinctions between running “in the cloud” and “on the cloud”; the ability to tune, tweak and manage costs are some of these distinctions. At the end of the day the Cloud Service Providers are businesses out to maximize revenue, not missionaries proselytizing a new design center. Yes, cloud is the newest, and increasingly dominant design center, but never forget cloud service providers are businesses out to maximize revenue.

They are experts at presenting to their customers the “economically defined architecture”. Through their documentation, training, certifications, sku-ing, and feature segmentation they guide the cloud customer to acceptable, best-practice recognizable, deployment architectures. In fact, due to the consistency of the aforementioned assets (training, docs, etc..) they have most likely raised the bar in terms of overall deployment quality for many enterprises. All that said, the amalgam of capabilities and the manner deployed ALSO maximize their revenue. This is not surprising.

When you create a deployment architecture guided by cloud service provider user interface wizards and devops automation template best practices, it is quite possible you are getting a number of constructs your application doesn’t need, despite the fact that some applications might need them. Is this a global scale application and you hope to be the new Netflix? Or is this a supply chain application shared by 12 partners in the midwestern automated sprinkler space?

The cloud service providers are making specific architectural recommendations, that easily convert to IT deployment decisions, that are integrated to their pricing strategies. It is a level of alignment probably not seen in the previous design cycle where you would hire a consultancy to help with a data center deployment of vendor product XYZ.

Cohesive has provided cloud connectivity and security, helping customers get to, through and across the clouds since 2008, as such we have seen a lot of both customer migrations and greenfield development. There are an array of capabilities many of these applications do NOT need. To put it another way “not all of them need all of” cloud vendor-provided autoscale, load balancers, direct connects, transit gateways, nat-gateways, BGP, complex service roles/permissions, global network funnels, lambdas, etc.. Yet, due to a combination of the path of least resistance, peer conformity and even “cool kid behavior”, your team might feel reticence to dispense with any of these constructs. Practically speaking, there is a strong chance that declining default practices and recommendations could create personal risk for them. What if there is an issue, and the initial blowback narrative of an outage is “well Bob said we didn’t need a cloud load balancer”? Even if the root cause analysis shows it wasn’t the load balancing or lack thereof that caused issues – does “Bob” ever get truly “cleared of the charges”.

At the end of the day cloud service providers are extracting billions with a “B” from their customers for what is without argument high quality deployment architectures. But those architectures have some of their own complexity in that they are possibly more complex than what many customers need. Even if you set aside the runtime costs of those components, and some of the inherent complexity, there is another “gotcha”, which is “data taxes”. Increasingly newer offerings have both runtime (price per hour or price per invocation costs for usage) but also added on data taxes on top of the existing cloud egress charges. This is where 3rd party offerings from vendors like Cohesive, (certainly not limited to us) can make a substantial cost difference. Our view on the data taxes is that they are an awful lot of icing on some pretty thin pieces of cake.

I do want to note that our whole business has been built around cloud, and we ourselves use a lot of it, in fact too much of it from a pure costing point of view. We trade off speed for $$$, and while sometimes hard to quantify, we believe in the payoff. Cloud lets us move fast, and when you do so you leave bits of cost lying about via both neglect and opportunism. For the most part we do this with eyes wide open.

That to us is the crux of it. Controlling cloud costs requires a clear headed view by IT management of what their options are. What is the cost of conformity to cloud service provider economically defined architecture? What is the cost of non-conformity? If you are trying to save money, increase your non-conformity. If you are trying to reduce a measure of both actual and perceived risk increase your conformity. BUT, when you are trying to figure out how to cut cloud spend by seven figures (or a multiple of that), understand the structure of the tradeoffs you need to make.

Cloud Instance Quality vs. Cloud Platform Cost-at-Scale

Cloud Instance Quality vs. Cloud Platform Cost-at-Scale

What is the failure rate of cloud instances at Amazon, Azure, Google?

I have looked for specific numbers – but so far found just aggregate “nines” for cloud regions or availability zones. So my anecdotal response is “for the most part, a REAL long time”. It is not unusual for us to find customers’ Cohesive network controllers running for years without any downtime. I think the longest we have seen so far is six years of uptime. 

So – with generally strong uptimes for instance-based solutions, and solid HA and recovery mechanisms for cloud instances – how much premium should you spend on some of the most basic “cloud platform solutions”?

Currently cloud platforms are charging a significant premium for some very basic services which do not perform that differently, and in some cases I would argue less well than instance-based solutions; either home-grown or 3rd-party vended. 

Let’s look at a few AWS examples:

  • NAT-Gateway 4.5 cents per hour plus a SIGNIFICANT data tax of 4.5 cents per gigabyte
  • Transit Gateway VPC Connection 5 cents per hour for each VPC connection plus a HEALTHY data tax of 2 cents per gigabyte
  • AWS Client VPN $36.50 per connected client (on a full-time monthly basis), $72 per month to connect those VPN users to a single subnet! (AWS does calculate your connected client costs at 5 cents per hour, but since we should all basically be on VPNs at all times, how much will this save you?)

NOTE: The items I call “data taxes” are on top of the cloud outbound network charges you pay (still quite hefty on their own). 

If you are using cloud at scale, depending on the size of your organization, the costs of these basic services get really big, really fast. At Cohesive we have customer’s that are spending high six figures, and even seven figures in premium on these types of services. The good news is for a number of those customers it is increasingly “were spending”, as they move to equally performant, more observable, instance-based solutions from Cohesive.

Here is a recent blog post from Ryan at Cohesive providing an overview of Cohesive NATe nat-gateway instances versus cloud platforms. For many, a solution like this seem to meet the need. 

Although – I think Ryan’s post may have significantly underestimated the impact of data taxes.

    So you say “Yes, instance availability is really good, but what about [fill in your failure scenarios here] ?”

    Depending on how small your recovery windows need to be, there are quite a range of HA solutions to choose among. Here are a few examples:

    • Protect against fat-finger termination, automation mistakes with auto-scale groups of 1, and termination protection
    • Use AWS Cloud Watch and EC2 Auto Recovery to protect against AWS failures
    • Run multiple instances and add in a Network Load Balancer for still significant savings
    • Use Cohesive HA plugin allowing one VNS3 Controller instance to take over for another (with proper IAMs permissions)

    Overall, this question is a “modern” example of the “all-in” vs. “over-the-top” tension I wrote about in 2016 still available on Medium. More simply put now, I think of the choice as being when do you run “on the cloud” and when do you choose to run “in the cloud”, and ideally it is not all or none either way.

    In summary, given how darn good the major cloud platforms are at their basic remit of compute, bulk network transport, and object storage, do you need to be “in the cloud” at a significant expense premium, or can you be “on the cloud” for significant savings at scale for a number of basic services?

    NATe: A Tax-Free Alternative to Cloud NAT Gateways

    NATe: A Tax-Free Alternative to Cloud NAT Gateways

    Whether you need to connect multiple cloud instances, communicate with the public internet from private resources, or directly connect to instances in local data centers, chances are you will be using Network Address Translation (NAT) to make that connection. All major cloud providers provide some product or service to provide NAT functionality, and some platforms even provide separate public and private variants. Because cloud instances running in private subnets are unable to access resources like time servers, webpages, or OS repositories without NAT functionality, most users find themselves relying on their cloud platform’s NAT offerings. By simply following their cloud providers’ recommended best practices, users are overpaying for an overcomplicated and inflexible service that a home cable modem does for free.  So why pay so much for such a simple network function?

    If You’re Using Cloud Platform NAT Gateway(s), You’re Overspending on Cloud Deployments.

    Overspending of any kind in the wake of the economic disruption caused by the COVID-19 pandemic can be deadly for any business. Yes, some have fared better than others during this challenging time but all organizations have revisited projections and budgets in the face of uncertainty. According to Gartner, the pressure is on for budget holders to optimize costs.

    Pre COVID-19
    of enterprises planned IT budget increases
    Post COVID-19
    of enterprises expect IT budget decreases

    Where to Start?

    Look to the sky! Your cloud bill is likely full of opportunities for savings, especially if your application relies on NAT functionality. Using AWS NAT Gateway pricing as an example, let’s start with the comparative base subscription costs:

      AWS NAT Gateway VNS3 NATe
    Subscription $0.045 / hour $0.01 / hour*
    Data Processing (TAX) $0.045 / GB $0.00 / GB
    * Price includes runtime fees (on-demand t3.nano $.0052 / hr) + NATe subscription ($0.005 / hr)

    As you can see from this example, the standalone subscription cost of an AWS NAT gateway is more than the cost of a single t3.medium instance. The already low VNS3 NATe subscription cost will provide you even more savings when you consider the fact that you don’t have to create as many individual NAT gateways, each of which would be  accompanied by an additional AWS NAT Gateway subscription. The cost differential here makes NATe an obvious choice at any deployment scale and we even offer a free NATe license for smaller deployments.

    VNS3 NATe is also incredibly scalable because we don’t increase our data processing rates as your bandwidth needs scale.  Below is a pricing table that shows the total cost of running a single NAT Gateway vs a VNS3 NATe instance as the traffic throughput increases in a given month:

    GB / Month AWS NAT Gateway VNS3 NATe
    1 $32.45 $7.20
    10 $32.85 $7.20
    100 $36.90 $7.20
    1,000 $77.40 $7.20
    5,000 $257.40 $7.20
    10,000 $482.40 $7.20
    50,000 $2,282.40 $7.20
    100,000 $4,532.40 $7.20

    We also have customers who maintain 100s or 1000s of VPCs with NAT requirements of 1-100 GB per month.  Those enterprise cloud customer at scale have typically seen costs drop to 1/5 of what they would pay for AWS NAT Gateways.  To illustrate this savings, take the example from one of our customers has 1800 VPCs each with a NAT Gateway.  The total data processed through these NAT Gateways is low and averages 10GB / month with much more potential savings for deployments that pass more traffic out the NAT device.

    AWS NAT Gateway VNS3 NATe
    Monthly Runtime $58,320 Monthly Runtime $12,960
    Data Processing (TAX) $810 Data Processing (TAX) $0
    * Price includes runtime fees (on-demand t3.nano $0.0052 / hr) + NATe subscription ($0.005 / hr)

    Total NATe saving per month in this case is $46K and $554K per annum.

    Of course, costs savings are not limited to just NAT Gateway spend.  Other opportunities for savings include right sizing instances (latest generation instance families are always less expensive), decommissioning unused services/resources (I’m looking at you load balancers), and reviewing storage strategies (such as EBS).

    What is a NAT Gateway?

    A NAT Gateway is a network service that performs a simple network function: Network Address Translation for cloud-based servers running in a private network (private VPC subnet). Here is the AWS documentation detailing the NAT Gateway functionality. NAT Gateways perform a specific type of NAT called IP Masquerading, where devices in a private IP network use a single public IP associated with the gateway for communication with the public Internet.

    This is the same function that your home modem performs for free. You’re likely leveraging this NAT functionality as you read this post. Basically the NAT functionality on a NAT Gateway or your home modem allow devices on a private network (computers, phones, TVs, refrigerators, toothbrushes, etc. in the case of your home network) to access the Internet and receive responses but not allow devices on the public Internet to initiate connection into your private network. All traffic sent from the private network to the public Internet uses the modem’s public IP address.

    NATe to the Rescue!

    In response to direct requests by our customers, we created a low-cost, instance-based alternative to NAT Gateways – VNS3 NATe.

    Available on AWS PM and Azure MP today:

    * No subscription premium but total throughput limited to 50mbps

    What is a NATe?

    NATe instances are drop-in replacements from Cohesive Networks for NAT Gateways. Simply launch in a VPC/VNET subnet with an Internet Gateway associated, Stop Src/Dst checking (enable IP forwarding), and update the Route Tables associated with the private Subnets to point destinations at the NATe instance-id.


    NATe provides all the functionality of a NAT Gateway plus enterprise grade security and controls at a fraction of the cost. Some of the functional highlights of NATe include:

    • High Performance – run on the smallest instance sizes to maximize value or larger instance for greater total throughput
    • Secure – access to a firewall to allow additional and orthogonal policy enforcement for traffic flows
    • Control – access logs, network tools like tcpdump, status information
    • Customize – leverage the Cohesive Networks Plugin system to add L4-L7 network services to the NATe instance like NIDs, WAF, Proxy, LB, etc.
    • Automate – fully automate the deployment of VNS3 NATe instances as part of your existing deployment framework leveraging the RESTful API to reduce implementation costs.
    • Failover – NATe can be configured in a number of HA architectures to provide the same level of insurance needed for critical infrastructure via instance auto recovery, auto scale groups, and Cohesive Networks’ own Peering and HA Container functionality
    • Upgrade – NATe is fully upgradeable to fully licensed VNS3 controllers deployed as a single application security controller or part of secure network edge mesh

    Still Not Convinced?

    Cohesive’s NATe offers a dramatically more cost-efficient solution to often critical NAT requirements in cloud deployments of all shapes and sizes. NATe is more flexible, more scalable, and easier to manage than first-party cloud NAT gateways that are charging you a premium for the functionality of a standard consumer modem. If you don’t believe us, we launched a free version of our NATe offering in both the AWS and Azure marketplaces so you can launch and configure them and see for yourself!

    Have questions about set-up or pricing? Please to contact us.

    Managing DNS for Remote VPN Users in AWS Route53 with VNS3

    Managing DNS for Remote VPN Users in AWS Route53 with VNS3

    Managing DNS can be a fairly complex and daunting task. Installing and configuring Bind takes time and knowledge and requires maintenance. Infoblox is expensive and likely overkill for smaller projects. Cloud vendors like AWS have simplified offerings that allow ease of use and scale with your needs. They offer public and private zone management with features like split horizon. Split horizon allows Domain Name Systems to provide different information based on the source address of the requestor. For example, if you are coming from the internet at large you would receive the public IP address of the named system you are looking up, but if you were in the same private subnet as that system you would receive it’s private IP address. This allows you to define the how users get to systems depending on where they are.

    Let’s take the example of a remote VPN connection. With VNS3 People VPN you can easily connect your workforce to your cloud assets, be they across regions and or vendors. Giving you a secure entry point to your companies computational resources. VNS3 makes it easy to push DNS settings to connected clients so that they are told that their DNS server is the address of the VNS3 security controller. So now we have connected clients making DNS calls to VNS3. But hold on VNS3 isn’t a DNS server. Well it can be through it’s plugin system, but thats a different topic for another blog post. In this scenario we can divert all incoming DNS traffic through use of the VNS3 firewall.

    Cohesive Networks VNS3 Controller Connectivity
    Lets say that our VNS3 overlay address space is, this is what we are using for our remoteVPN users, and our VPC is In this case there are two addresses that we care about. is the Virtual IP of the VNS3 security controller and which is the AWS VPC Route53 Resolver or DNS endpoint. In AWS the DNS endpoint will always be the .2 for your VPC address space. So our firewall rules will look like this:

    PREROUTING_CUST -i tun0 -p tcp -s –dport 53 -j DNAT –to
    PREROUTING_CUST -i tun0 -p udp -s –dport 53 -j DNAT –to

    Here we are saying that traffic coming in on the tun0 interface (overlay network) from (overlay address space) bound for UDP and TCP port 53 (DNS) should be forwarded to on UDP and TCP port 53 (AWS VPC DNS endpoint).

    Ok so now that we have our remote VPN DNS requests being diverted to the VPC DNS endpoint we need to configure our responses. In Route53 you can configure any zone name you want so long as it isprivate. For public zones you will need to own the domain name. But for private zones you can do whatyou want. This can be very useful where you might have a secure IPSec connection to a partner network and want to use DNS names that reflect your partners name and configure addresses across your tunnels. You can set up as many private zones as you want. Once they have been setup it is now just a mater of associating them with the VPC that your VNS3 security controller resides in. you will now have custom DNS naming for your remote workforce.

    Securely Federating Cloud Services with VNS3

    Securely Federating Cloud Services with VNS3

    Service Endpoints are a great concept. They allow you to access things like s3 buckets in AWS from within your VPC without sending traffic outside of it. In fact, from a compliance perspective they are essential. Both Amazon Web Services and Microsoft Azure have them. One drawback in AWS is that they can only be accessed from the VPC in which they have been set up. But what if you wanted to access that s3 bucket securely from another region or from an Azure VNET? Perhaps you have an Azure SQL Data Warehouse that you want to access from your application running in AWS. Service Endpoints have their limitations. For many companies that are developing a multi cloud, multi region strategy, it’s not clear how to take advantage of this service. We at Cohesive Networks have developed a method that allows you to access these endpoints from across accounts, regions and across cloud providers. This blog post will discuss in detail how we achieve these ends.

    Using AWS Private Service Endpoints

    In order to interact with AWS Service Endpoints you need a few things. You need DNS resolution, which needs to occur from inside our VPC, and you need network extent, or the ability to to get to whatever address your DNS resolves to. Both of these two conditions are easy to achieve from any VPC or VNET using VNS3 configured in what we call a Federated Network Topology and utilizing the VNS3 docker based plugin system to run bind9 DNS forwarders. Let’s start by taking a look at the VNS3 Federated Network Topology.

    VNS3 Federated Architecture Diagram

    The core components that make up the Federated Network Topology are a Transit Network made up of VNS3 controllers configured in a peered topology and the individual Node VNS3 controllers running in the VPCs and VNETs. All controllers are assigned a unique ASN at the point of instantiation. ASN or Autonomous System Numbers are part of what allow BGP networks to operate. We configure the Node controllers to connect into the Transit Network via route based IPSec connections. By using route based VPNs we can then configure each VNS3 controller to advertise the network range of the VPC or VNET it is in that it wants other networks to be able to get to. This route advertisement gets tied to its ASN which is how other VNS3 controllers know how to get to its network. This gives us the network extent that we need so that even in a complex network comprised of tens to hundreds to thousands of virtual networks spread across accounts, regions and cloud providers we have have a manageable network with minimal complexity. 

    In order to interact with a AWS Service Endpoints you need a few things. You need DNS resolution, which needs to occur from inside our VPC, and you need network extent, or the ability to to get to whatever address DNS resolves to. Both of these two conditions are easy to achieve from any VPC or VNET using VNS3 configured in what we call a Federated Network Topology and utilizing the VNS3 docker based plugin system to run bind9 DNS forwarders. Let’s start by taking a look at the VNS3 Federated Network Topology.

    VNS3 Transit Diagram explainer

    Setting up DNS Resolution

    The next component we need is a system to return the correct localized DNS response. AWS uses what is called split horizon DNS, where you will get a different response based on where you are making the request from. That is to say, if you make the DNS call from outside you will get the public facing IP address versus if you make the call from inside you will get the private IP address. Say we are in an Azure VNET in US West and need to access an s3 bucket in us-east-1. You would install the AWS command line tools (CLI) on either your linux or windows virtual machine and run something like:

    aws s3 ls s3://my-bucket-name

    to get a current listing of objects in your s3 bucket. But you want this interaction to be routed across your secure encrypted network. How would you get the DNS to resolve to the private entry point inside of your sealed VPC in AWS us-east-1 rather than the AWS public gateway? The answer lies in DNS.

    VNS3 has an extensible plugin system based on the docker subsystem. You can install any bit of linux software that you want into a container and route traffic to it via its comprehensive firewall. So here we can install Bind 9, the open source full-featured DNS system, into the plugin system. We can configure Bind 9 to act as a forwarder filtered on a certain dns naming patterns. In this case we would be looking for the patterns of and which we will configure to forward on to the plugin address of the VNS3 transit controller running in AWS us-east-1 that is configured to forward all of it’s incoming requests down to the AWS supplied DNS at x.x.x.2 of of it’s VPC. This will return the correct IP of the s3 Service Endpoint that has been configured in the transit VPC in AWS us-east-1. So when the AWS CLI makes the call for “s3://my-bucket-name” the first action that takes place is a DNS call of: “What is the IP to interact with the service for?”, next it will attempt to connect to that IP, which we have made possible as we have created the network extent to that address. From there you can do all the things that you need to do in regards to the bucket. There are some other configuration items that need to be put in place as well. You would need to either configure your whole VNET subnet that the virtual machine resides in or the individual network interface of your VM to point to the private IP of your VNS3 controller as it’s DNS source. The firewall of VNS3 will need to be configured to send incoming TCP and UDP port 53 traffic to the container running Bind 9. And you will need to setup routing rules for your subnet or network interface to point to the VNS3 controller for either the explicit addresses of Service Endpoints in AWS or send all traffic through the VNS3 controller. The latter has extra benefits as VNS3 can act as your ingress/egress firewall, NAT and traffic inspector by adding further plugins like WAF and NIDS.


    The above is an illustration of one use case, accessing an AWS s3 bucket from Azure across an encrypted network with full visibility and attestability. Other possibilities include the entirety of Service Endpoints offered by Azure and AWS and the mechanics are the same whether AWS to Azure, vise versa, or across regions inside a single cloud provider. The take away is that VNS3 has powerful capabilities that allow you to create secure extensible networks with reduced complexity and inject network functions directly into them that allow you to take advantage of cloud provider services in an agnostic way.

    AutoRecovery in the Public Cloud

    “Everything fails, all the time.” – Werner Vogels

    While VNS3 is extremely stable, it is not immune to the underlying hardware and network issues that public cloud vendors experience. VNS3 provides a variety of methods to achieve High Availability and instance replacement. However all of that takes place above the customer responsibility line. What can you do for your cloud deployment to protect yourself from the inevitable failures that take place below the line?

    On top of the solutions offered by public cloud providers, Cohesive Networks offers a variety of methods for achieving instance and network recovery, whether it be BGP distance weighting, Cisco style Preferred Peer lists or our Management Server (VNS3:ms) which will programmatically replace a running instance or facilitate Active & Passive running of VNS3 instances. Keep an eye on this blog space for further discussions in these key areas.

    AutoRecovery in AWS via CloudWatch

    Amazon Web Services has perhaps the most comprehensive function for protecting yourself from underlaying failures. They offer what they call a CloudWatch alarm action. This monitor is tied to your instance ID, should AWS status checks fail, your instance will be brought up on new hardware, while retaining its instance ID, private IP, any Elastic IPs and all associated metadata. You get to set the periodicity of the check and the total checks that will kick off the migration. So if you need to have assurance that you instance will get moved to good hardware after as little as two minutes, you can set it as such. From a VNS3 perspective, this ensures that any IPSec tunnels will get reestablished, any overlay clients will reconnect and any route table rules pointing to the instance will maintain health once the instance has recovered. On top of all of this you can configure it to publish any alarm states to an SNS topic so that you receive notification should this occur. Cohesive Networks highly recommends that you set this up for all VNS3 controllers and Management Servers.

    You can find out more about configuring AWS CloudWatch alarm actions here:

    Service Healing in Azure Cloud

    The Microsoft Azure cloud has the concept of “Service Healing.” While it is not user configurable, it is not dissimilar from AWS in that Azure has a method whereby it monitors the underlaying health of the virtual machines and hypervisors in it’s data centers and will auto recover virtual machines should they or their hypervisors fail. This process is is managed by their Fabric Controllers which themselves have built in fault tolerance. As of now Azure does not provide any user controls over this process nor notifications and the process can take up to 15 minutes to complete, since the first action is to reboot the physical server that the virtual machines run on and failing that will then proceed to migrate VMs to other hardware. Azure does state that they employ some level of deterministic methodologies for pro-active auto-recovery.

    Live Migration in Google Cloud

    The Google Cloud Platform has taken a fairly different approach. Over at the Google cloud all instances are set to “Live Migrate” by default. So should there be a hardware degradation and not a total failure, your VM will be migrated to to new hardware with some loss of performance during the process. If there is a total failure your VM will be rebooted onto new hardware. This also applies to any planned maintenance that might effect they underlaying hardware your VM is running on. As with AWS and Azure all of your instance identity will transfer with the VM such as IPs, volume data and metadata. Should you want to forgo the “Live Migration, you can configure your instances to just reboot onto new hardware. All failed hardware events in GCP are logged at the host level and can be alerted on.