I’m sure this work is very impressive, but these QPS numbers don’t seem particularly high to me, at least compared to existing horizontally scalable service patterns. Why is it hard for the kube control plane to hit these numbers?
For instance, postgres can hit this sort of QPS easily, afaik. It’s not distributed, but I’m sure Vitess could do something similar. The query patterns don’t seem particularly complex either.
Not trying to be reductive - I’m sure there’s some complexity here I’m missing!
I am extremely Not A Database Person but I understand that the rationale for Kubernetes adopting etcd as its preferred data store was more about its distributed consistency features and less about query throughput. etcd is slower cause it's doing RAFT things and flushing stuff to disk.
Projects like kine allow K8s users to swap sqlite or postgres in place of etcd which (I assume, please correct me otherwise) would deliver better throughput since those backends don't need to perform consenus operations.
But, and I'm honestly asking, you as a GKE user don't have to manage that spanner instance, right? So, you should in theory be able to just throw higher loads at it and spanner should be autoscaling?
> To support the cluster’s massive scale, we relied on a proprietary key-value store based on Google’s Spanner distributed database... We didn’t witness any bottlenecks with respect to the new storage system and it showed no signs of it not being able to support higher scales.
Yeah, I guess my question was a bit more nuanced. What I was curious about was if they were fully relying on normal autoscaling that any customer would get or were they manually scaling the spanner instance in anticipation of the load? I guess it's unlikely we're going to get that level of detailed info from this article though.
it's not really bottlenecked by the store but by the calculations performed on each pod schedule/creation.
It's basically "take global state of node load and capacity, pick where to schedule it", and I'd imagine probably not running in parallel coz that would be far harder to manage.
No a k8s dev, but I feel like this is the answer. K8s isn't usually just scheduling pods round robin or at random. There's a lot of state to evaluate, and the problem of scheduling pods becomes an NP-hard problem similar to bin packing problem. I doubt the implementation tries to be optimal here, but it feels a computationally heavy problem.
fuse based filesystems in general shouldn’t be treated as production ready in my experience.
They’re wonderful for low volume, low performance and low reliability operations. (browsing, copying, integrating with legacy systems that do not permit native access), but beyond that they consume huge resources and do odd things when the backend is not in its most ideal state.
> While we don’t yet officially support 130K nodes, we're very encouraged by these findings. If your workloads require this level of scale, reach out to us to discuss your specific needs
Obviously this is a typical experiment at Google on running a K8s cluster at 130K nodes but if there is a company out their that "requires" this scale, I must question their architecture and their infrastructure costs.
But of course someone will always request that they somehow need this sort of scale to run their enterprise app. But once again, let's remind the pre-revenue startups talking about scale before they hit PMF:
Unless you are ready to donate tens of billions of dollars yearly, you do not need this.
People at my co are horny to adopt k8s. Really, tech leads want to put it on their resume ("resume driven development") and use a tool that was made to solve a particular problem we never had. The downside is now we now need to be proficient it at, know how to troubleshoot it, etc. It was sold to leadership as something that would make our lives easier but the exact opposite has happened.
I work for a mature public company that most people in the US have at least heard of. We're far from the largest in our industry and we run jobs with more than that almost every night. Not via k8s though.
You have jobs running on more than 130k different machines daily??
Are they cloud based VMs, or your own hardware? If cloud based, do you reprovision all of them daily and incur no cost when you are not running jobs? If it's your own hardware, what else do you do with it when not batch processing?
Same here. Non Kubernetes project originated control plane components start failing beyond a certain limit - your ingress controllers, service meshes etc. So I don't usually take node numbers from these benchmarks seriously for our kind of workloads. We run a bunch of sub-1k node clusters.
130k nodes...cute...but can Google conquer the ultimate software engineering challenge they warn you about in CS school? A functional online signup flow?
does anyone know the size at openai ? it used to run a 7500 node cluster back in 2021 https://openai.com/index/scaling-kubernetes-to-7500-nodes/
I’m sure this work is very impressive, but these QPS numbers don’t seem particularly high to me, at least compared to existing horizontally scalable service patterns. Why is it hard for the kube control plane to hit these numbers?
For instance, postgres can hit this sort of QPS easily, afaik. It’s not distributed, but I’m sure Vitess could do something similar. The query patterns don’t seem particularly complex either.
Not trying to be reductive - I’m sure there’s some complexity here I’m missing!
I am extremely Not A Database Person but I understand that the rationale for Kubernetes adopting etcd as its preferred data store was more about its distributed consistency features and less about query throughput. etcd is slower cause it's doing RAFT things and flushing stuff to disk.
Projects like kine allow K8s users to swap sqlite or postgres in place of etcd which (I assume, please correct me otherwise) would deliver better throughput since those backends don't need to perform consenus operations.
https://github.com/k3s-io/kine
You might not be a database person, but you’re spot on.
A well managed HA postgresql (active/passive) is going to run circles around etcd for kube controlplane operations.
The caveat here is increased risk of downtime, and a much higher management overhead, which is why its not the default.
GKE uses Spanner as an etcd replacement.
But, and I'm honestly asking, you as a GKE user don't have to manage that spanner instance, right? So, you should in theory be able to just throw higher loads at it and spanner should be autoscaling?
Yes, from the article:
> To support the cluster’s massive scale, we relied on a proprietary key-value store based on Google’s Spanner distributed database... We didn’t witness any bottlenecks with respect to the new storage system and it showed no signs of it not being able to support higher scales.
Yeah, I guess my question was a bit more nuanced. What I was curious about was if they were fully relying on normal autoscaling that any customer would get or were they manually scaling the spanner instance in anticipation of the load? I guess it's unlikely we're going to get that level of detailed info from this article though.
it's not really bottlenecked by the store but by the calculations performed on each pod schedule/creation.
It's basically "take global state of node load and capacity, pick where to schedule it", and I'd imagine probably not running in parallel coz that would be far harder to manage.
No a k8s dev, but I feel like this is the answer. K8s isn't usually just scheduling pods round robin or at random. There's a lot of state to evaluate, and the problem of scheduling pods becomes an NP-hard problem similar to bin packing problem. I doubt the implementation tries to be optimal here, but it feels a computationally heavy problem.
there is a doc about how to do with 1M nodes: https://bchess.github.io/k8s-1m/#_why
so i guess the title is not true?
That's simulated using kwok, not real.
> Unfortunately running 1M real kubelets is beyond my budget.
THis is a PoC not backed by a reliable etcd replacement.
AWS and Anthropic did this back in July: https://aws.amazon.com/blogs/containers/amazon-eks-enables-u...
That is 100k vs 130k for Google’s new announcement. I can’t speak as to whether the additional 30k presented new challenges though.
They mention GCS fuse. We've had nothing but performance and stability problems with this.
We treat it as a best effort alternative when native GCS access isn't possible.
fuse based filesystems in general shouldn’t be treated as production ready in my experience.
They’re wonderful for low volume, low performance and low reliability operations. (browsing, copying, integrating with legacy systems that do not permit native access), but beyond that they consume huge resources and do odd things when the backend is not in its most ideal state.
AWS Lambda uses FUSE and that’s one of the largest prod systems in the world.
An option exists, but they prefer you use the block storage API.
> While we don’t yet officially support 130K nodes, we're very encouraged by these findings. If your workloads require this level of scale, reach out to us to discuss your specific needs
Obviously this is a typical experiment at Google on running a K8s cluster at 130K nodes but if there is a company out their that "requires" this scale, I must question their architecture and their infrastructure costs.
But of course someone will always request that they somehow need this sort of scale to run their enterprise app. But once again, let's remind the pre-revenue startups talking about scale before they hit PMF:
Unless you are ready to donate tens of billions of dollars yearly, you do not need this.
You are not Google.
> You are not Google.
100% agree.
People at my co are horny to adopt k8s. Really, tech leads want to put it on their resume ("resume driven development") and use a tool that was made to solve a particular problem we never had. The downside is now we now need to be proficient it at, know how to troubleshoot it, etc. It was sold to leadership as something that would make our lives easier but the exact opposite has happened.
I work for a mature public company that most people in the US have at least heard of. We're far from the largest in our industry and we run jobs with more than that almost every night. Not via k8s though.
You have jobs running on more than 130k different machines daily??
Are they cloud based VMs, or your own hardware? If cloud based, do you reprovision all of them daily and incur no cost when you are not running jobs? If it's your own hardware, what else do you do with it when not batch processing?
They are provisioned on demand (cloud) and shut down when no longer needed.
>You are not Google.
It's literally Google coming out with this capability and how is the criticism still "You are not Google"
The criticism is at pre-PMF startups who believe they need something similar
Doing this at anything > 1k nodes is a pain in the butt. We decided to run many <100 nodes clusters rather than a few big ones.
Same here. Non Kubernetes project originated control plane components start failing beyond a certain limit - your ingress controllers, service meshes etc. So I don't usually take node numbers from these benchmarks seriously for our kind of workloads. We run a bunch of sub-1k node clusters.
Same. The control plane and various controllers just aren't up to the task.
130k nodes...cute...but can Google conquer the ultimate software engineering challenge they warn you about in CS school? A functional online signup flow?
The could team up with Microsoft, because their signup flow is fine but the login flow is badly broken.
For what? Access to the control plane API?
In general... Try to sign up for their AI services...
Imagine a Beowulf cluster of these
[dead]
The new mainframe.