Another item to add to the “stupid bar fight” category = Kernel Mode VMs under Containers

I wrote up a document based on a couple customer interactions over the last month (one customer wanted a formal write-up) where this debate has raged, and then I saw a post by a VMware colleague, Kenny Coleman posted here: If Kubernetes is in, is vSphere out?  

I decided to post this publicly and not just keep it Pivotal internal dialog for a couple of reasons:

  1. I think it’s good to debate and hash out the merits of points of views broadly – more eyeballs and points of view tends to be better.   INPUT WELCOME!
  2. I think it’s a good dialog to have OUTSIDE of the context of a VMware-center PoV.   Don’t get me wrong, I’m sure that my PoV is influenced (after all, VMware is family, and I’ve been so close to them for so long) but Pivotal’s point of view on this naturally has a different center of gravity than VMware – as everything we do at abstractions above the IaaS layer has to be multi-cloud.

I’ll note that this is a case where I’ve benefited from a lot of input and internal debate/dialog with my respected colleagues like the awesome Cornelia Davis (@cdavisafc) and many others.   Thanks all!   This is however, my post – so any errors are purely my fault 🙂

So what’s this about?  What’s the bar-fight?  Containers, some say, render kernel-mode (hardware) virtualization unnecessary.

  • Why do you need virtual machines (VMs) now if you can run containers on host OS running on raw physical hardware?
  • With Kubernetes orchestrating your containerized applications, could it serve as THE virtualization layer in your IT infrastructure – why layer compute abstractions?
  • Put simply some ask: “ Should I simply run Kubernetes on bare metal?”   

The non TL:DR version of the answer: for MOST enterprise K8s use cases, I think the answer is almost always ‘no’.

Combining containers and VMs taps the benefits of each technology.  It creates an organized stack with clean abstractions that streamlines the development, deployment, and management of enterprise applications and meets a set of requirements I’ll get into through this post.    

To be clear, the answer isn’t always no – there are cases where Kubernetes is being used for very specific, very focused use case.  An example are some HPC containerized workloads with very low-level grid scheduling engines (though in those cases it may be arguable about whether you need Kubernetes vs simply a container runtime).    

Saying something is absolutely universal is the perhaps the most certain way to be absolutely wrong.   

But as a generalized answer, in my experience, building your Kubernetes solution on a virtualized infrastructure based on kernel-mode virtualization brings some material benefits to enterprises:

  • Optimizes resource utilization & reduces cost
  • Enhances operational agility & flexibility
  • Offers better security, especially in multi-tenant workloads

These are the internal points on why PKS has chosen to consistently prioritize using clean infrastructure abstractions and leverage the use of broadly used underlying IaaS abstractions – including kernel-mode hypervisors on premises, and public cloud IaaS APIs.

Ok – now that we’ve hit it at a high level, read on for the full-on Virtual Geek TL:DR version 🙂

Resource Utilization

Resource utilization is measured as the efficiency ratio of raw physical compute, memory, network, storage assets in relationship to effective utilization. This ratio is determined through two factors – resource scheduling and abstraction overhead, both of which need analysis.

When scheduling container or kernel-mode VMs on raw physical resources, this is sometimes referred to as “bin packing”. The question of “does using a kernel-mode hypervisor improve or degrade bin-packing efficiency?” is too simplistic.

For example – for some single workloads – the answer may be “yes”. The clearest example of these scenarios are common/consistent workloads deployed on very large worker node. Kubernetes clusters particularly when they have very rapid resource scheduling intervals, and may also include many smaller container/pod instances (relative to the worker node resources).

However, there are even cases of a single workload that would have the opposite effect (reduced infrastructure efficiency, and lower degrees of bin packing) – resulting in an answer of “no”.  An example of this workload would be Spark, that favors a hard reservation model, and tends to want to use the full K8s worker and use it’s own scheduler. In these cases, it’s common to either not use Kubernetes, or, when using bare-metal, to dedicate and design worker nodes for fit. This results in a very poor infrastructure utilization on bare-metal. Apache Spark 2.3.0 introduced the first example of Apache Spark scheduling that integrates with the Kubernetes scheduler, so this will improve over time – but this example is illustrative that even for a single workload – the answer is not always clear what will result in the “best bin-packing”.

A more realistic, common pattern in the enterprise is having mixed pod workloads of all types co-existing on shared resources. When many mixed workloads are used, simple method resource scheduling often isn’t enough, and results in lower overall resource utilization and lower “bin packing”.   It’s for these reasons that we often see large kernel-mode VM cluster resources in enterprises that often mix PKS and non-PKS workloads.

YES, there are increasingly sophisticated ways of doing resource quota management natively in Kubernetes namespaces, commonly used for scheduling containers based on CPU, memory – but also extensible to other resources.  This is used by the cluster manager for initial pod scheduling on worker nodes. Kubernetes can leverage reservations and even limits during scheduling of pods. It was relatively recently (Kubernetes 1.9) where preemptive prioritization (evacuation of pods to free up resources for pods with higher priority when performance is impacted) was added as an alpha Kubernetes feature (beta as of K8s 1.11). Pods will not survive eviction (unlike a live kernel-mode vmotion) and will be rescheduled by the Kubernetes cluster manager.

Note that prioritization is a non-namespaced feature – one of many items which highlights that for most customers who have multiple different tenants and workload, they will need multiple clusters (for example – multiple tenants whose workloads cannot evict another) even before you get to security considerations (of which there are many – discussed later in this post).

The ability to use a kernel-mode hypervisor to dynamically re-allocate resources to provide Kubernetes worker nodes – even if not oversubscribed – provides the ability to have more flexibility of the resource bin-packing problem for mixed workloads.   While not so relevant RIGHT NOW, there’s also interesting R&D about how this could be used for even more elasticity to the worker nodes themselves, and even skinny down the stack further.

Furthermore, consider that in some cases, physical resource “adjacency” matters – something that cannot be done practically purely via Kubernetes resource scheduling on bare metal without contortions at the physical infrastructure layer, or depending on proprietary physical hardware. For example – imagine a workload where there is a very large (10’s of GBps) amount of networking bandwidth required between different pods and the workloads are very dependent on their latency.   This is not a hypothetical case, BTW – this came up at a recent multi-vendor, multi-customer session we held in San Francisco together with Portworx and we asking about real workloads (in this case containerized data service platforms).

In the scenario above, there can be massive optimization for the workload (and reducing infrastructure costs) by minimizing east-west physical network traffic – and keeping inter-pod traffic on the same physical node. In some of these cases, the data pods must be in different kubernetes clusters due to network, kubernetes version support, API server configuration, security/tenancy (see more on this below). Without kernel-mode hypervisors which enable multiple kubernetes clusters to co-exist on physical hardware via virtual worker nodes – this configuration is manifestly impossible to support.

While the example I pointed out was one particular workload, it’s notable that this “eliminate the physical network cost” bonus is uncommon in certain compute/data service scenarios and overcomes the kernel-layer virtualization latency and CPU cycle costs of context switches (even though the moves over the last 5 years in the kernel-mode networking space have reduced this impact enormously – more on this below).  This is one of many examples where resource scheduling is more flexible using both the container abstraction and kernel mode virtual machine abstraction together.

Beyond resource scheduling – there is a question of “what is the kernel mode hypervisor raw performance impact?”.

To consider this purely from a Kubernetes standpoint, imagine a hypothetically perfectly scheduled set of pods which are containers living on virtual Kubernetes worker nodes that entirely dedicate a physical host as the worker node (“perfect bin-packing”). With this model, ask “what is the performance delta of passing through a kernel-mode hypervisor?” or more precisely “exactly how would performance change if we ran the same scenario without a kernel-mode hypervisor?”

The answer as a general statement is that the performance overhead of modern kernel-mode hypervisors is relatively minimal.  Even for very high performance workloads, including those that generally suffer more materially from kernel-mode virtualization and relatively large amounts of context switches (storage, networking IO) or those with specific GPU support pass-through have strong options for near-native performance (using DirectIO pass-through and SR-IOV for IO, and GPU pass-through as appropriate).

  • Third party testing found commonly on the internet generally does not leverage these approaches (they don’t always configure the guests or kernel-mode hypervisors properly) – but even in those cases, performance is shown to be quite similar.
  • Very dated (2014) performance testing by IBM comparing container abstractions and the impact of KVM’s QEMU kernel-mode abstraction shows that the performance impact is very similar in many cases – and KVM using QEMU pales relative to more modern kernel mode hypervisors.
  • In more recently demonstrated cases by VMware using vSphere 6.5 performance of some containerized workloads is surprisingly better through a kernel-mode hypervisor. This is due to the kernel mode hypervisor scheduling being aware of modern Intel and AMD CPU NUMA architectures, and leveraging that to schedule compute adjacent to the necessary memory that is lower latency and higher bandwidth.   With CPUs getting more and more NUMA every day (those cool new AMD Threadripper 2990WX CPUs are causing some $$ to burn a hole in my pocket) this is a real fact.

The answer of “what is the performance impact of the kernel mode hypervisor” is “it depends, but for most workloads, it’s a very small impact”. There may be some workloads where the impact (small may it may be) that it may result in some additional cost – but the question needs to be framed against multiple considerations, including operational considerations.

Operational Agility

While resource “maximization” is often the focus of discussion of whether or not to use bare-metal relative to using a kernel-mode hypervisor in conjunction with Kubernetes – the latter topics of operational and security considerations have an ongoing, latent/hidden impact, often far, far more impactful than a performance consideration which can be solved with simpler approaches.

Operational complexity results in a multitude of immediately apparent impacts, but also impacts that simply cannot be known – because future change cannot be predicted. This means that two principles are essential to reducing the cost of change overall:

  • Architectural choices that drive towards clean abstractions between parts of the technology stack that move/change independently of one another = minimizing the risk of any given unforeseen change.
  • Architectural choices that limit the lower the cost of change generally = minimizing the cost of any given unforeseen change.

Let’s examine these in more detail, starting with “hardware abstraction = clean abstraction + lowering the cost of change”.

Without a common base hardware abstraction, there are two necessary dependencies that arise:

  1. Accepting the need for validation, testing, interoperability, and automation for the world of all possible/likely hardware variations – forever;
  2. or… tightly restricting the hardware variations to single vendors, and single platform choices.

Let’s examine an example.  If a storage controller on a Kubernetes worker node changes, and that worker node is physical host, then the driver support/testing is dependent on the particular host OS. The automation/scripting/configuration management to update that storage controller will be dependent on any nearly infinite set of tools to update the driver/firmware.  This may seem like a trivial problem – but if you imagine all the updates for the full stack underneath the kubernetes worker nodes, and the network (and container network interface ecosystem) and storage (and the container storage interface ecosystem) over a multi-year period – it results in the fragile web of spaghetti and scripting only known by individuals (which in turn becomes a critical dependency) that holds most enterprise IT together.  

Yes, Linux OS driver support is very good – but consider what this does for the base host OS – it gets fat.  You end up doing more and more to either manage fleets of different host OSes (yuck, with all the security complexity that adds) or you end up with a single consistent image that is overweight (yuck, with all the security complexity that adds).

Conversely, by using a kernel-mode virtualization abstraction – all hardware below the abstraction becomes the problem of the virtualization provider, and all the hardware presentation above the abstraction is normalized. This results in “common configuration” with a much larger set of customers – and being “boring and common” in this case results in a simpler operational model.  It also lowers the complexity of the “standard OS image” of the Kubernetes worker node – because a minimal driver set, automation tools need to be loaded – it will always be the same – without binding to specific hardware.

The second point (“tightly restricting hardware variations”) is rooted in the fact that while industry standard (“commodity” – I’m not sure how much that word is true) x86 hardware rules the market – there is absolutely not any industry standard that is widely used and standardized for hardware level APIs, tools, update/lifecycle management. The way that this is done across multiple vendors is unique in each case. So – only by standardizing on single vendor solutions (even subsets of those vendors portfolios as very commonly there’s a different set of tools and APIs even within single vendors) is sustained automation even possible.   The key word there is sustained – it’s of course possible to automate anything… the question is how much fragility is built into the automation itself.  I would argue that (variation in what is being automated) = 1/(sustainability of said automation).

This is why another common driver of bare-metal dialog (avoiding “hypervisor lock-in”) doesn’t hold up to rigorous scrutiny IMHO.  

The reality is that every choice is a form of lock-in – the question is whether the benefit outweighs the cost, whether the cost of change is lower than the benefit received. When these are true – the “lock-in” is a benefit. When these are not true, the “lock-in” has a negative effect. If avoiding standardizing on a single hypervisor as a common hardware abstraction forces one to build their own automation versus leveraging prior art of hundreds of thousands of customers, and to sustain that automation, testing, validation, driver/firmware work drives one to a single vendor hardware ecosystem to keep the hardware APIs consistent – how can that possibly be a good form of lock-in, as in that scenario, you’re binding yourself to the lowest point of value, but one of the highest capital costs – the underlying industry standard hardware.

This driver is one of the reasons the hyper-scale cloud providers have large “bare-metal as a service” operational functions – teams that sustain hardware standards, strip out all vendor-unique attributes… AND THEN use a lot of kernel-mode virtualization.   These “bare-metal as a service” functions are operational roles that even the largest enterprises cannot sustain or justify, because unlike the hyper-scale cloud providers, this isn’t their core business.

Furthermore, remember that even in the case of the hyper-scale public cloud providers with scaled hardware teams, they generally use very thin hypervisor abstractions widely to avoid creating hardware dependencies and security/tenancy considerations (more on this shortly).

Another example of the operational benefit of clear abstractions and APIs between independent layers is the reality of multi-cloud use cases. While some customers will leverage different K8s platforms on and off-premises, including native cloud providers K8s managed services, many customers are realizing that while these share K8s API conformance, there are very, very material differences between the K8s services themselves, the control planes, the operational models of these various services.

Those customers often want a single K8s platform that can run on premises and off-premises.  Trying to do this without a common API abstraction that interacts with each cloud via their IaaS API has proven to be unsustainable. Sustained operational success of a multi-cloud approach requires a clean infrastructure API abstraction – trying to sustain broad on-premises hardware variants all with their own APIs, automation/tooling at the same time as multiple public cloud APIs is simply not possible.

Since PKS has as one of it’s core design tenets to be a simple, consistent, common K8s platform across multiple clouds – a clear API abstraction that can cover the majority of the enterprise market is necessary – otherwise ongoing development, testing and validation of PKS using vSphere, GCP (both available in PKS today) and AWS (coming very soon), Azure, and the VMware Managed Cloud (5 different IaaS platforms we are targeting) would explode out to a nearly impossible set.  

People are starting to realize – not all K8s are the same.   Read this post here.   Like I pointed out publicly on the twitters – I’m sure the great team at Microsoft will be all over it, but there are a surprisingly amount of material differences for things that all meet basic K8s conformance tests.   There’s a reason we think PKS being very aligned with “keep as close to vanilla K8s as possible” and “test constantly not just against K8s conformance, but with other projects like Istio and knative”… these things are important.

Imagine for a moment – if PKS sustained engineering on Dell EMC iDRAC Redfish, HPE Synergy APIs, Cisco UCS APIs, every networking variation and storage stack as well as the public cloud IaaSes.   Well, while not impossible, it would result in a lower rate of delivering higher-order benefits of hardening K8s, the PKS control plane, observability, and developer value on top of PKS – all in Multi-Cloud operational models.  

If this is hard for scaled vendor partners (like us) – imagine choosing consciously to take on that task as an individual or a company.   Does it really benefit you?  Does it really make you do high value things better?   Or are you doing it for resume-generating engineering reasons (“because I can”)?

Security and Tenancy.

In order to share resources in a way that meets enterprise requirements around resilience, security and many other attributes we have to establish tenancy domains. You might have separate tenancy domains for different workloads (i.e. core banking vs. wealth management), different customers (i.e. American Airlines card vs. Southwest Airlines), different lines of business (i.e. wireless, landline, cable, content), different functional organizations, and different stages of the app lifecycle. 

Tenancy domains establish a structure around which those shared resources are controlled – they are a place where we can implement policies; policies such as access control, network/physical isolation, different data security postures, different workload priorities, etc.

Multi-tenancy in Kubernetes remains quite “soft”. While basic RBAC can be tied to a namespace – the tenancy construct of Kubernetes – allowing individuals or groups only certain privileges within specific namespaces, many other resources are still shared with much softer boundaries – the compute capacity, an internal DNS, the network (unless using a full-featured SDN), cluster configurations including API server options.

Several examples are useful to illustrate this concretely:

Example 1 – resource management

Earlier in this post, it was noted that pod preemptive eviction was a non-namespace parameter, which leads to the practical benefits of both multiple Kubernetes clusters (for resource management that follows tenancy consideration) and also leveraging kernel-mode hypervisor resource management.

Example 2 – cluster configurations, including privileged containers

Another immediate example that is very focused on tenancy are privileged containers. Privileged containers have the host cgroup limitations removed, and therefore can do almost everything that the host can do.  While privileged containers are not bad inherently, there are practical cases (common in enterprise, multi-tenant, multi-workload scenarios) where they cannot be eliminated pragmatically. This leads to three very immediate implications: 1) enterprise customers will not practically be able to have “massive single kubernetes clusters for all workloads” since the Pod Security Policy is a cluster-level resource; 2) and yes, you CAN do pod-level security context controls – this workaround may not fit practically into other cluster-level RBAC controls; 3) so… there are many cases where a kernel-mode virtualization abstraction underneath the Kubernetes worker node enforces an additional security boundary.   Kubernetes is a “leaky” abstraction – one of the virtues/deficits of being so flexible.

Will multi-tenancy in Kubernetes get harder/stronger? Almost certainly. There is a working group and multiple Kubernetes Special Interest Groups (SIGs) focused on this problem and some really sharp people are part of those groups.

How long will it take? Years. The working group has been meeting since January and they are still most definitely “storming.”

Arguably it took the better part of a decade for VMware and others to mature tenancy and security boundaries in kernel-mode virtualization..

To address the tenancy needs today, and while there are religious debates with fanatics on either side of “containers + virtualization”, there is a pragmatic solution that’s staring you right in the face (if you can put fanaticism aside): use what exists in K8s (Linux container isolation, RBAC, cluster and pod namespace security policy where appropriate) in combination with the strong tenancy constructs that exist in virtualized infrastructure abstractions/hypervisors.

In practice what I’ve been seeing is that all of the above translates to multiple “single tenant” clusters instead of fewer (soft) multi-tenant clusters. This is the model that is enabled with the likes of all of the public cloud offerings (certainly GKE) as well as PKS. It’s the result we see at customer after customer as they build more K8s experience, particularly in multi-tenant, multi-workload use cases so common in the enterprise.

Even GKE On-Prem depends on virtualized infrastructure – has made it clear that when they GA they will target vSphere – bare metal support is highly speculative at the moment.   It is instructive that cluster-api, the OSS that GKE On-Prem depends upon, also mentions that bare metal is theoretical and providing support is dependent on real customer need (implying that the need has not yet been clearly demonstrated to Google satisfaction at least yet – similar to what we’ve seen).

Conclusion – bringing this all home.

In this very long-winded blog post, we’ve explored some of the essential elements when evaluating Bare Metal vs. kernel-mode hypervisors for use with Kubernetes. Ultimately I believe that the benefits spans 3 topics: 1) resource utilization considerations; 2) operational considerations; 3) security/tenancy considerations; those reasons why the clean declarative infrastructure API abstraction enabled by kernel-mode virtualization is essential for sustained Enterprise platform success with Kubernetes.

While there are always some workloads for which this guidance doesn’t hold true (perhaps because of the unique technical/business characteristics of that workload) I would encourage readers, customers, competitors to be pragmatic and ask themselves a question.

Ask whether those workloads are the rule for all workloads (in which case go bare-metal), the exception (in which case if the exception is important – consider two approaches: bare-metal for the exception, kernel mode virtualization for the rest), or whether the whole debate is an excuse to pursue an intellectually curious path (in which case re-direct that curious intellectual energy burning in you towards something more productive – a better bar-fight).

So… What say YOU dear reader?   What has been your experience – in favor, or against these arguments?


Leave a Reply

Your email address will not be published. Required fields are marked *