A Tezos Bakery on Kubernetes
With the Tezos-on-GKE project, you can deploy a complete baking infrastructure faster than ever before.
Update: we have now launched our staking-as-a-service company, MIDL.dev. We can help you deploy and manage the infrastructure described in this article. Contact us to become a baker !
There are many tools and instructions out here to bake Tezos. We want to take it a step further. In this article, we present a simple way to spin up baking infrastructure in the cloud.
The Tezos network thrives when there are many baker nodes in the network. Babylon, the most consequential on-chain amendment to the Tezos protocol, went through a few hurdles when it went live a few days ago. If more people had joined Babylonnet (the testnet for the upgrade) more bugs would have been caught prior to launch. But it can be hard to deploy a replica of your baking setup for test purposes.
We are releasing this infrastructure-as-code to facilitate deployment of full baking nodes and increase baker efficiency. All code is under Apache 2 license.
This allows to spin up a new Tezos baking environment in a matter of minutes, and have a synchronized node in hours ! It requires only a handful of utilities (gcloud, docker, kubectl, terraform). There is no server to install.
Check it out ! Review and clone the code at https://github.com/midl-dev/tezos-on-gke
Blockchains on Kubernetes
Kubernetes is not an obvious choice for blockchains. After all, it was started to help web-scale companies manage their deployments. But the ecosystem is maturing.
As Kubernetes becomes the standard way to deploy workloads in the cloud, it becomes an obvious choice to host cryptocurrency nodes in Kubernetes.
Tezos already builds and provides containers. We provide the YAML plumbing code to deploy a baking node in the cloud.
The resulting setup is highly available. The baking node has regional storage across two availability zones of a Google Cloud region. Shall one zone go down, the node wakes up, already synchronized, in the other zone. Google promises four nines (99.99%) availability for a regional cluster. And the best part — it is all defined in code and deploys in minutes.
We take infrastructure as code to the extreme. There are literally no manual steps. You just need a Google Cloud account. Terraform creates everything for you: the private network, load balancers, the cluster, and finally builds and deploys your containers. When your deployment terminates, your nodes are already syncing.
Your cloud baker can scale vertically — unlike home baking, there is no need to rush a hardware upgrade when the requirements to run a node change. Just soup up the CPU, storage and RAM with a simple deploy command. It’s that easy !
The cloud is not good enough for the baking key
Cloud Computing is great to quickly deploy resources, and provides great uptime, but it brings a few security risks:
- the software may be exploited. In this setup, the node runs in a container hosted in a VM on cloud servers. Any layer may be vulnerable.
- you may accidentally leak your secrets (operator error)
- your credentials may get stolen
- the cloud provider may terminate your account
For this reason, we have released provisioning code to help you host the signer in a location under your control rather than in the cloud. Cloud companies provide Cloud Hardware Security Modules that may be a good alternative to this (PRs welcome!), but they tend to be expensive and you are still at the mercy of getting your account terminated, or getting locked out.
We recommend deploying two Raspberry Pi signers connected to one Ledger Nano each. These signers should have a UPS hat for resiliency against power outages and LTE dongles for resiliency against wired Internet access failure.
We provide ansible source code to deploy the entire configuration on Raspbian OS, hosted at:
The hardware requirements to run a signing node are unlikely to change. Unlike the full node, which may require more CPU, RAM and storage as the network grows in popularity, signing on a Raspberry Pi and Ledger Nano combo is likely to keep on working for a long time.
High-availability in Tezos comes with the risk of double-baking or double-endorsing. If you get in a state where two baking nodes are active at the same time, you may end up broadcasting two conflicting operations on the network, for which you get punished.
Let’s consider what needs to happen for that to be the case in the setup we are describing:
- the regional Kubernetes cluster gets in a split-brain situation where two availablility zones loose communication between each other and each ones starts a baking node
- the Ledger app contains a protection against double baking. The high watermark increases after every operation, so two operations at the same block height are impossible on the same Ledger. The load balancer always targets signer 1 unless it becomes offline. If signer 1 becomes unreachable, then signer 2 may sign an operation at the same block height.
The combination of these two events within a block is considered unlikely. We believe double-baking risk is mitigated in our setup.
Pay your rewards… on time
The baking infrastructure includes the awesome Backerei software from@cryptium. It runs in its own pod and sends payouts at the end of every cycle.
Once payouts are done, a Kubernetes cron job generates markdown payout files and deploys them to a static Jekyll website on a Google Storage Bucket, so delegators can look up their payouts.
Nothing comes without an effort
Can anyone be a baker by typing a few commands now ? No ! You still need to understand how the process works. You need to maintain operational security. You need to pay attention to your delegations and avoid overdelegation. You need to communicate with your delegates. Most importantly, you need to do your duty as a baker and participate in the on-chain governance by voting.
But we are taking away the grunt work that every baker goes through. Just deploy a fully featured baking infrastructure on a testnet, learn how it works and focus on your operations.
It is critical to monitor your baking operation and receive alerts when something is not right.
We are working on releasing our monitoring code. There are two components to it:
- internal monitoring: is the baking node connected to two public nodes ? Are the two remote signers ready to sign operations ? Is the most recent block current ?
Prometheus seems to be a good candidate to monitor these metrics. A Tezos prometheus exporter exists and should be appropriate to run as a sidecar of the baking node.
Google Stackdriver has some alerting capabilities that may be useful to act upon these metrics.
- external monitoring: while cluster observability is essential, you must also roll out an external node to observe the behavior of the baker from the point of view of the network itself. We recommend deploying tezos-network-monitor from Polychain Labs.
Hey, Tezos developers ! Feature requests
There are a few improvements in the Tezos codebase that could make cloud baking infrastructure simpler, cheaper and more resilient.
Special storage mode for bakers
To compute payouts, bakers need to go back in the staking history, further than the “full mode” keeps. In this setup, the central baking node currently runs in archive mode.
Most of the archive data is however useless to us, therefore it would be great to have a storage mode where we drop all the old data, except the staking data of a given baking address, which is kept forever.
Bakers and archive nodes both provide a service to the network, but they need not be the same entities. This proposal should significantly reduce the space requirements on baking nodes, making them cheaper to operate.
We use haproxy to probe signers for liveness, however this probe only signals that the signer has the right address configured. It does not indicate whether the Ledger is unlocked, has the baking app running and set up to bake for this address.
Locally on the signer, one can run the following command to perform these checks:
tezos-client get ledger authorized path for <account-alias-or-ledger-uri>
But there is no way to trigger this command remotely using the http endpoint. If this was possible, it would increase the reliability of our system.
This code is still fresh, since we have only deployed our own baker with it. A lot of parameter combinations have not been tested, and a few features are missing.
For example, it is currently not possible not to use the remote signers, and instead have a hot baking private key stored as a Kubernetes secret. But for Babylonnet or Zeronet, that may be useful.