PKS 1.2 – What’s new, behind the scenes, and what’s next!

Last Wednesday, PKS 1.2 was released.  I have been slammed at Spring One Platform (which was awesome!), and then travelling all over, so this post is a little delayed, but as usual, verbose, filled with detail and “behind the scenes” info….

  • First, a huge, personal congratulations to the joint VMware/Pivotal product management and engineering team on PKS! I’ve seen how hard they’ve worked for months, and they should be proud of their baby hitting a huge milestone!
  • Second, don’t hesitate. You (or your CI/CD pipelines) can get the PKS 1.2 bits here.   As always – blog posts/powerpoints are worth little.  Code speaks.  Customers sing.
  • Third, docs are here.
  • Fourth, if you are a VMware SE and you or your customers wants a feature, or need something fixed – the best route is the github repo here. If you’re a Pivotal Platform Architect, ask for help on the #pks-tech slack channel and the AHA input here.
  • Fifth, read up on the official release blog here.

What’s new in PKS 1.2?   A ton.   More than I can list, but here are highlights:

  • Kubernetes 1.11.x
  • Expanded multi-cloud support (AWS gets added to vSphere, GCP – Azure coming soon) – without the need for complex fragile scripting.
  • Multi-master/multi-AZ goes GA (no change in code, just hammering on process, failure conditions, etc)
  • Integrated namespace log sinks for devs (above and beyond platform ops observability)
  • Much more feature rich K8s cluster creation options and post-deploy scaling
  • Network profile configurability (I’m very happy about this one)
  • Huge under the covers changes in NCP (the NSX-T container plugin)
  • Support for NSX-T 2.2.x (and 2.3.x will come in a very fast-follow update).  [UPDATED – 10/9/2018 @  11:48 – PKS 1.2 now supports NSX-T 2.3 – more here]
  • K8s API role binding with LDAP
  • TLS termination for load balancers
  • … and a lot more!

My friend and colleage Naranyan does a great job summarizing what’s new here:

The headline (at least to me) is PKS is cooking and 1.2 is a shot of nitro.  

What do I mean?   Well – I mean this:

  • PKS 1.0 was a fresh baby – VMware and Pivotal as proud parents, but it was decidedly a 1.0 release. It’s an MVP to enter the market and engage with customers for the initial fast feedback loops.
  • PKS 1.1 is when we had customers start to push on PKS for real (and the first few early customers going production).
  • With PKS 1.2 and NSX-T 2.2.x (and NSX-T 2.3 in VERY short order) – I think we’re entering the period where hundreds and then thousands of customers will deploy PKS into production.

For some perspective, while it’s still early days for PKS (heck it’s still early days for Kubernetes in the enterprise for real) – but I’ve seen this before, it’s not my first rodeo 😊   The customer adoption ramp of PKS looks like vSAN, or NSX, or VxRail, but steeper.

Now that we’ve got that behind us – what can I add?   The “behind the scenes” of course!  

Read on for more!

I’m still learning a lot about the Pivotal/VMware machinery – but the “mouth to firehose” period is starting to wind down for me.   It’s FASCINATING – because now I can pause and look around.   What I’ve seen/experienced is an interesting contrast to the “infra” world I’ve inhabited for the last 14 years professionally.

Let’s start with cadence.

There have been 11 PKS releases by my count (including 3 majors) in the last 8 months.

Within Dell EMC, the release cadence for products is on roughly an annual calendar (anchored in Dell Tech World) – and major hardware platform refreshes are every 3 years.    Customer update their infra in multi-year cycles.

Within the world of most of VMware’s enterprise software (vSphere, vSAN), likewise there’s a major refresh calendar with a familiar tick/tock cadence anchored in 12/18 month cycles (anchored in VMworld) – and of course patches between major releases.   Beyond the “major release cadence”, there’s historically a “6 month pause, then mass deploy”.  Ergo – VMware customers on their core platforms have often expressed to me that they use the 6-month window between a major (vSphere 6.5) and u1 (vSphere 6.5u1) to test, learn – all before putting it into production.

The world of Pivotal is different.   It’s been interesting to be up to my elbows with PAS and PKS customers for the last few months – the contrast is stark.    The OSS world moves on a much, much faster cadence, and means that applying the principles of CI/CD that are used in the application domain simply HAVE to be part of how cloud-native platforms are maintained.

It’s been equally interesting to be immersed in how Pivotal tackles this problem.   There is Continuous Delivery religion here at Pivotal.    It’s applied to everything – from how we build products, how we structure teams, how we engage with customers.    I suppose it’s not surprising, as the company is built on a foundation of Extreme Programming, Acceptance Test Driven Development.

I can vouch for this – as someone intimately working with our “Top 5” PKS customers – I can tell you that our backlog is loaded with their inputs (reflecting a never-ending “gap tracking” model) and that as we iterate like crazy, are on constant deep dives with them to drive requirements.    We push back like crazy internally on “we should do this because it’s cool” – instead working stuff from the top of the backlog.

I’m used to “listen to your customers” and “move fast” (values Dell EMC and VMware also share) – but this is a different universe.

The way we (VMware and Pivotal) are building PKS is built around fast iteration, intent-led design, and constantly working that backlog.   We aren’t perfect, but biasing for decisions that lower the cost of change + rapid iteration = things get good surprisingly fast.

We were not the first to the enterprise K8s market.   But we’re starting to get PKS feedback that echoes what customers love about the Pivotal Application Service – it just works, and with the Concourse pipelines driving constant updates I get messages like this tweet.

Now, this happens to be from Matt Cowger, a friend and a fellow Pivot – but the experience is echoed by customers all over.

I’ll give you an example.   In talking to the team at Boeing last week, they told me that the PCF platform updates itself every week like clockwork every Wednesday.   1800 applications.   1000 developers.   Supported by 6 platform operations folks.    Whoa.  If you look at the Spring One Platform sessions – you’ll hear this echoed from a ton of our customers.

I challenge anyone working on their own OSS science experiments in the enterprise (K8s, Openstack, you name it) to aim at that sort of goal: it’s not about doing what’s cool, it’s about making things that are fundamentally cool so effective they become boring.   I find that invariably, they end up back-revved years back – they just cannot sustain that cadence.

That’s VERY important in the K8s universe.

Look at this cadence, and ask yourself: “can I really operationalize our platform with a cadence that does full platform updates every 1-3 months vs. every 12-18 months?”   If the answer is “no” (and honestly, this what I see at almost all enterprises trying to use OSS to build their own platforms) give us a shot.   Pivotal’s religion around CD approaches to EVERYTHING means that platforms that update themselves are possible.

Sidebar: we are really, really good at this, but not perfect.   2 months ago, we had a customer who’s CI/CD automation for PCF broke (and we broke it).   The part of the story that was the most fascinating (and gives you insight into how seriously we view this “platform automation thing”) is that the response was intense self-reflection, and a complete deep RCA on how we broke our own pipelines.  Fail fast, and fix fast.

While “modern” Pivotal was always like this – it’s as interesting to see what’s happening in parts of VMware.   VMware is also making changes in this regard – in large part driven by their journey to being more “cloud like”.   The fast feedback loops inherent in more and more of their stuff being “SaaS” and the forcing function of VMC are also driving this cultural/operational change.   More on the linkage to NSX-T below.

I’m always amazed how customers try so hard, fight so valiantly to build and sustain this sort of thing themselves.

Of course, those Enterprises are filled with brilliant, passionate (and I would argue mis-directed) technologists building and sustaining their OSS-based IaaS/CaaS/PaaS/FaaS.

But, if I ask them – how many of them can keep up with the 3 month K8s core cycle, not to mention the things beside (eg. Prometheus), below (a frothy CNI/CSI ecosystem), and on top of K8s (workloads like the data ISV ecosystem and layers above the stack (Istio, Knative, etc) – well then the truth comes out, and they confide that they know it just cannot be sustained.

Advice: remember three mantras:

  1. “yes, you will need multiple abstractions (it’s an “and” not an “or”) – but constantly aim for the highest abstraction you can for any given workload…”
  2. “…regardless of the right abstraction – aim your efforts ABOVE the value line for that particular abstraction (a different place for each abstraction) – focus on what you consume vs. what you construct” (this is a familiar topic – and true at EVERY layer of the stack).
  3. “…focus on running the platform itself as a product or an app – with CD as your hallmark, and your time from idea->backlog->prod as a core metric.”

Next, let’s talk multi-cloud.

So… in PKS 1.2 we added AWS, and Azure support is right behind it, at which point PKS will natively support all the major IaaSes.

If you’re the video type – watch the below (thank you Dan Baskette!)

Aside – notice how much time is spend on the networking/loadbalancing configuration steps.  This highlights how with K8s – this is a critical portion of complexity.   Right now we use NSX-T to simplify this on-premises, eventually native NSX-T support on public clouds other than VMC will be an interesting domain.   In the “todo” pile (see the “what’s next” part of the blog – more on this….

We view the cloud IaaS support as an integral part of PKS – which means that PKS itself (including CFCR – the native K8s core) operates the same on every cloud, that every cloud has the same observability/authentication integration points.   Common control, common behavior.    In each cloud – we move all the K8s and core OS patching/updating to “below the value line” – we take care of that.

Also, the multi-cloud integration with PKS is “batteries included” – meaning that we own not only making it work, but sustaining the integration.   This is not a case of “with a whole pile of custom scripting/automation it can work on any cloud” (which ends up all fragile and brittle = not sustainable).

Now… the “behind the scenes” story on multi-cloud here is interesting…

… at the start of PKS, we weren’t sure if customers would want PKS to be multi-cloud.

Some originally figured “hey, K8s is the standard cluster management/scheduler, and docker images are the container standard – SURELY by the time we GA, GKE, AKS, EKS will be roughly equivalent and mature… so people may want PKS, but they are likely to use the native K8s services from the hyper-scale folks”.

Interesting.  I was in that camp.   I was wrong.

Yes, of course, many customers are using the native K8s managed services from the hyper-scale public clouds.

But now…

… people are realizing how different they all are.    GKE remains the gold standard.   AKS and EKS have a ways to go (Chad’s humble opinion) and they are iterating fast.   But – the ways they all deliver observability, authentication, and others services (networking certainly) are wildly different.

More than HALF the PKS customers want vSphere + another public IaaS (or two).   We learnt something (which is always priceless).

The reason customers want the Pivotal Application Service on multiple clouds (common control layer – with all the observability/authentication things that go with it, total guaranteed portability) exists for lower-level abstractions too, like PKS and K8s.   Not as much perhaps, but the need is the same.   It is also why, although I was skeptical initially about the VMware Managed Cloud – I’m seeing that people value this common control value.

That said – there are many REAL needs to “punch out” and use native hyper-scale cloud services directly.   Like so much – it’s an “and” not an “or”.    When should you “punch out”?   When the benefits of going through a common abstraction layer are outweighed by the benefits of the native cloud service.

So – in our extended family, we do “punch out” with some control via vRealize with a focus on infrastructure constructs (including at the K8s level) and doing that via the Open Service Broker focused on higher abstraction developer services (with new AWS integration there too read here – joining the mature GCP OSB here and Azure here)

What have we learnt about networking?

This has proven to be perhaps the biggest area of learning that was unexpected (or insufficiently understood) work.   We knew we would need to think and work hard on day-2 operations.   We knew we were biting off a lot (and in ways not everyone agrees with) around “being the Kubernetes for Kubernetes” with CFCR and with on-demand PKS (the core idea of how do we deliver fleet management for K8s clusters).   We knew we had a lot to do (and lots more to do!) around observability, around security, around authentication.

… But I don’t think we knew just how complex the networking domain would be.

One thing I take great delight in is the “whole product” philosophy we’re taking with respect to PKS.   While technically the core is CFCR (K8s + BOSH) + OD-PKS (On-demand PKS = control plane), how it integrates with Harbor, how it integrates with NSX-T, how we bring that all together – PKS is viewed by the combined VMware/Pivotal team as a whole, not a series of parts.

With PKS 1.2, I’m particularly excited about something customers should never have to see – not a “feature”, but how internal stuff should work in our family.

For the first time, the core NSX-T development team in VMware and the PKS teams have been truly operating in the way we would aim.   The ATDD approaches have really started to cross-pollinate.   We completely re-architected the NCP lifecycle management for deeper integration.   The early concourse pipelines for NSX-T which were “generated in the field” have been picked up by the NSX and PKS product team.

Are we done?  Hell no.   I think this is a domain where we can do a lot together and delight a lot of customers.

The original foundational idea: “a Kubernetes platform is incomplete without an SDN – so making that included, integrated, and supported” has proven to be right on.   I’m proud of the progress we’ve made over 8 months, and I’m encouraged over the direction I see ahead.

Now… since we do so much with NSX-T, I’ve been asked multiple times “do we support non-NSX-T options?”

Short answer: YES.

Here’s the more complete answer – I assure you, the below is the official PKS position from VMware and Pivotal (though my blog is personal and not authoritative – I’ll get this published in a better place).   It’s also our set of recommendations and support statement regarding networking and storage providers:

  • The commercial PKS offering from Pivotal and VMW includes NSX-T as a tightly integrated and comprehensive container networking and security capability.
  • PKS has out-of-the-box support “batteries included” for NSX-T and Flannel. Flannel can be used when NSX-T support is not available on the IaaS or a customer prefers not to use the tightly integrated and comprehensive capability of NSX-T.
  • PKS embraces the open ecosystem and includes low-binding pluggable interfaces (e.g. CNI and CSI) as does all native Kubernetes platforms. VMware and Pivotal support these interfaces as part of PKS.  When using these interfaces with 3rd party solutions, customers are responsible for obtaining support of the 3rd party software.
  • Providers of 3rd party software are encouraged to join the VMware Technology Alliance Program to receive TSANet support from VMware to collaboratively resolve support issues.

So what does this mean?   We’re open.  We provide choice.   But we DO have an opinion, and it’s rooted in the fact that most customers ask for help using some form of “dammit, all I want is that this stuff all works” – and while they don’t say it, it is implied that: “…and I want it to work in a sustained way – I want the platform to be like a product”.

By biting off the networking domain (NSX in the SDDC for IaaS, in PKS for CaaS, in PAS for PaaS, in PFS for FaaS – and a lot of work around Istio at VMware and Pivotal – more on that later) as an integral part of the stack – it’s not closing choice, it’s about making the stack complete.

What’s Next for PKS?

 The short answer is a lot – but fundamentals aren’t changing.   The fundamentals that we’re aiming for with PKS is simple, and can be said in a sentence: PKS = the best, enterprise ready, fully integrated “batteries included” multi-cloud native K8s platform as a product… and enterprise-ready K8s “dialtone” if you will.

The other thing that won’t be changing is HOW we crank on PKS – we focus on iterating through our customer-generated (ergo real world), intent-led (ergo not a feature, but a use case) backlog.   We do this with a maniacal fanatical focus on sustainability.    This is a way that has proven itself, and as one quickly iterates through cycles, can be frustrating, but leads to happy places.

Areas of focus for us a product team:

  • Constant compatibility.   This never stops.   K8s 1.12 is now out.   Similar to the pattern we’ve shown an ability to sustain, customers should expect this in a couple of months to show up in PKS.  And, like with PKS 1.2, their pipelines will automatically update the platform.   Our focus on staying aligned, current, and consistent is not only important for PKS itself, but the work which layers on TOP of K8s (Istio, Knative and more).   Customer are thanking us for this commitment.
  • Continuing to improve the enterprise production experience. These are the highest priority items on the backlog.   As PKS moves into production in some of the largest enterprises in the world, we’re rapidly learning and improving here.   The list is really, really long, and as one would expect FILLED with topics focused on integration with customer’s availability, authentication, security, networking, observability requirements.   They also need a ton of stuff about K8s cluster fleet management (think easy, sustainable scaleup/down, worker node reconfig, and funky stuff like lifecycle management of different K8s versions).
    • Aside related to this topic: I’m often asked about GKE-on premises – as GKE is the “gold standard” of Kubernetes from the public cloud, and while VMware and Pivotal have moved way up the K8s core contributors, Google still dwarfs us all. From a Pivotal perspective Google is more of a partner (on K8s, Knative, Istio and much much more) than a competitor.    We are also culturally so similar.   That said a frank answer re GKE on premises s a bit of both – coopetition.   So – how do I feel?   Speaking for myself, I’m relaxed on the topic.   Why?   The hyper-scale public cloud folks have a ton of strengths – but they struggle mightily (for understandable reasons) about how to tango with Enterprise customers.   We’re not perfect as Pivotal and VMware (and more broadly Dell Tech) but we’re pretty darn good at it.
  • Expanding and improving multi-cloud. There’s not doubt that customers will want managed Kubernetes services (GKE, AKS, EKS, and offers from VMware/Pivotal).  Equally, there’s no doubt that customers will want Kubernetes they control that is multi-cloud, and is constant regardless of on-premises or on public cloud – and the latter is the focus of PKS.   People that hyperbolically posit that the world will be one or the other are getting into a “stupid bar fight” (I should trademark that) – the answer will be both.   So – that means that Azure support for PKS is a very high priority item in the backlog.  It’s also a never-ending effort – not only do we need to add Azure support, we need to keep improving the PKS experience on AWS, GCP (we can continue to improve here).
  • We know that the networking domain is an area of a lot of work, a lot of opportunity. This means continuing to work to stay in sync with NSX-T (2.3.x in very short order), improved deployment/lifecycle, improved scale – but other things too, including in the Flannel use cases (see the network position up above).  The backlog is rich with customer-requested feedback when it comes to networking.
  • More VMware integration. There’s already a ton of goodies in PKS 1.2 on this front.  Beyond Habor, NSX-T which are “batteries included” in PKS, there is deep existing PKS integration with Wavefront, vRealize Log Insight, the VMware CSI tools.   With PKS 1.2 and with vRealize Automation 7.5 there is now a TON of great integration with PKS around control, management, and policy controls for customer who want to extend vRealize into the K8s domain.   In ways that you don’t see at the surface, there’s also many other examples (we share a lot of telemetry infrastructure).   That all said – we’re just getting started.   Expect more and more PKS vCenter integration in the near term.
  • More developer value.   The namespace drain in PKS 1.2 was important first step.   The fact that at Spring One Platform we started to show things like Istio, Knative + Spring Cloud RIFF, and the fact that open, cloud native buildpacks are starting as a CNCF sandbox project (together with Heroku) are all the tip of the iceberg.   We are determined to keep bringing a great developer experience on K8s – but do it in our open, clean abstraction and multi-cloud way.
  • More recoverability.   With PKS 1.2 we crossed a critical milestone regarding enterprise readiness with the fundamentals of HA multi-az, multi-master support in PKS.  We also included the beginnings of recoverability stories – single master backup and restore.   We’ll of course expand this to the more common multi-master use case quickly, but a more complete set of stories around K8s platform recoverability gets hairy fast.   After all – without fully integrating the networking and persistence volume domain… well, we’re not covering the full problem domain.   This is a space where it shows how K8s will need to continue to mature (and we’re committed to help).

Phew – a big post, for a big release.   Congrats again to the product team, thank you – and THANK YOU to our customers, we’re honored to listen to you, to push you (as you do us), and to serve you!

Would, as always, love YOUR feedback!

Leave a Reply

Your email address will not be published. Required fields are marked *