A highly available Tezos baker on Kubernetes
Tezos is one of the major proof-of-stake cryptocurrencies. Our company, MIDL.dev, offers infrastructure services for Tezos bakers and builders.
We just released version 2.0 of Tezos-on-GKE, an open-source framework that we built and that we are using internally. We announced version 1.0 here in 2019. Several bakers have deployed our framework and are baking succesfully. In this post, we want to cover a new feature of the framework: baker high availability.
Limitation with current setup
Version 1 of Tezos-on-GKE had two public (sentry) nodes and one baking node. This baker was configured as a Kubernetes Deployment, which means that Kubernetes would ensure that there is always one instance (and only one instance) of the baking node running at any point in time.
In addition, we leveraged Regional Persistent volumes to ensure that the node storage would be replicated across several cloud availability zones, providing more resiliency.
A pod termination is a normal occurrence. For example, it can happen that Kubernetes nodes are auto-upgraded. When it happens, the Kubernetes control plane restarts pods in another node to bring the deployment back to a nominal count of one.
However, it is expensive to restart a Tezos baking node. It needs to initialize its storage, spin up its various components, establish peer-to-peer sessions to the sentry nodes, and catch up with the chain head. That can easily take minutes — especially when using compute instances with limited amounts of RAM and CPU. It may result in missed blocks or endorsements for a busy baker.
Enter the active-standby baking node
In the new model, we spin up two baking nodes as a StatefulSet and deploy a master election system to ensure only one bakes at a time.
The advantages of this approach are multiple:
- if one node has an issue, switching over baking operations to the other one is much faster than in the cold standby case
- automation is useful, but sometimes, you need to open a terminal into the baking node and perform manual operations (such as manual garbage collection). A switchover is an easy way to do maintenance without disrupting operations.
Here is how the new topology looks like:
When two baking nodes are alive, we need to select one to be the active baker. This problem can be solved with a distributed consensus protocol such as Raft. The good news is, Kubernetes already uses Raft and etcd natively, so we do not have to build and deploy a separate system. We can leverage native mechanisms to elect our baker. The master election pattern is a Kubernetes classic.
Importantly, the master election may only happen between availability zones within a cloud region. It is not possible to deploy this pattern across regions. Establishing a mutex pattern across long distances is complicated.
Kubernetes Stateful Sets are a good fit for active-standby baker nodes. An individual node teardown is a normal event. It could be due to an auto-upgrade, or a cluster scale-down. The Kubernetes scheduler will do its best to ensure at least one member of the set is operational at any point in time.
We are making the pods aware of the state of the StatefulSet by running a
master-elector container and wrapping the baker and endorser processes into a supervisord script that will only start the daemons on the current master node.
We are no longer using regional storage. Each node’s storage is local to one availability zone, which increases write performance. The overall storage usage and cost does not change, but CPU and RAM utilization goes up in this model.
The new active-standby mode can be activated with a terraform variable named
experimental_active_standby_mode . Set it to
true in your
terraform.tfvars file, then deploy a baker following our documentation.
leader_elector container in baking-node 0 shows:
tze-tezos-private-baking-node-self-0 is the leader
And the baker and endorser logs show that supervisord has recognized the pod as the leader and started the daemons:
2021-01-22 23:01:25,454 INFO supervisord started with pid 6
We are now the leader, starting endorser
2021-01-22 23:01:55,234 INFO spawned: 'tezos-endorser' with pid 40
2021-01-22 23:01:56,266 INFO success: tezos-endorser entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
On tezos-baking-node-1, the baker and endorser logs show that they are not active.
We now simulate a failure of baking-node-0 by terminating the pod.
$ kubectl delete pod tze-tezos-private-baking-node-self-1 -n tze
pod "tze-tezos-private-baking-node-self-1" deleted
A few seconds later, the endorser on the other baking pod starts:
We are now the leader, starting endorser
2021-01-26 03:59:43,273 INFO spawned: 'tezos-endorser' with pid 333796 2021-01-26 03:59:44,309 INFO success: tezos-endorser entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
tezos-endorser: startedNode is bootstrapped.
Meanwhile, Kubernetes automatically restarts the node that we terminated. However, the master election is “sticky”. Baking-node-1 will remain the master and keep baking.
Double baking protection
The single biggest risk in proof-of-stake infrastructure is equivocation, which means producing or endorsing two different blocks-essentially contradicting yourself. This is indistinguishable from malicious behavior and thus is punishable by slashing of funds. A misconfigured active-standby setup is a frequent cause for equivocation.
For example, if two Kubernetes availability zones loose connectivity between one another (split brain scenario), there is a possibility that both baking nodes will think they are the master and start baking.
However, Tezos-on-GKE leverages remote signers. Your production setup should consist of a physical signer connected to a hardware security module (such as the Ledger Nano) connecting to the cloud instances. The security module will refuse to sign several messages at the same block height and is therefore your last resort protection against double baking.
Remote signers can also be made redundant and highly available, however the only interface to them is a unique HAproxy instance. It is a very lightweight process that takes little time to load. It is run as a Kubernetes Deployment, which means that Kubernetes will ensure that only one runs at a time. HAproxy is configured to always forward the signing request to the first signer unless it is unreachable. The possibility of a double split-brain, where both baking nodes act as the master, and send requests to two different signers, is very remote.
Redundant silos separated by simple bottlenecks is the key to safe staking operations.
The active-standby feature is however experimental. Omitting the configuration flag will deploy only one instance of Tezos baker, as was the case in previous releases. You are invited to try out this feature on a testnet or private chain and do your own research, before rolling over your production baker to this model.
Tezos-on-GKE v2.0 is available here. We strive to make institutional-grade staking infrastructure available to you for free.
If you have any feedback or recommendations on how to improve the model, please contact us.