By Luke Youngblood
Since our launch in 2018, Coinbase Custody has been driving innovation on behalf of our clients, who increasingly want to participate in the networks that they have invested in. These new forms of network participation extend beyond providing the most secure and trusted digital asset custody available, and include activities like voting in governance and staking. We have been actively staking, or participating in the creation of new blocks in the digital ledger, on the Tezos network since early 2019. This is sometimes referred to as “baking.”
Supporting New Financial Products
In Fall 2019, Amun launched the first Tezos Exchange Traded Product (ETP), on the SIX Swiss Exchange. The Amun Tezos ETP is exposed to the price performance of the Tezos/XTZ token, but also generates a yield due to staking (baking) performed by Amun on behalf of investors. Coinbase Custody not only provides Custody of the underlying Tezos/XTZ tokens for Amun, it also stakes those tokens on their behalf. However, Amun needed this staking activity to take place in the EU so it could comply with regulations, but the Coinbase Custody Tezos Bakery was currently operating in the US. This is the story of how we migrated our Tezos Bakery from the US to Ireland with only 1 minute of downtime.
Securing the Tezos Network
Unlike proof of work, in proof of stake networks, security and operational skills take the place of raw compute power and electricity as the central mechanism that secures transactions in the digital ledger. Validators, which are the same as miners in proof of work, provide secure and highly available cryptographic attestations by voting on whether or not transactions in the digital ledger are valid.
Voting and Signature Generation
In order to vote or produce a block that other validators will vote on, a validator must sign the vote or block with their private key. If we simply stored the private key on our validator node, any attacker that was somehow able to compromise our validator node would not only be able to compromise our security deposits, which are required to stake on the Tezos network, they might also be able to attack the network itself by voting on an alternate version of the digital ledger that censors transactions. In proof of work this is called a 51% attack, and in proof of stake the consequences would be similar. The ability to censor transactions could compromise the core tenets of these decentralized networks: censorship resistance and immutability.
In order to increase the security of these networks and prevent validators from misbehaving and censoring transactions, validators must put up security deposits, sometimes called a bond, and if any participant on the network discovers that they have voted twice at the same block height, they forfeit these security deposits and rewards. This is known as “slashing” and is an important mechanism that protects proof of stake networks. The Coinbase Custody bond serves as the security deposit that mitigates the risk of potential loss of client funds.
Prior to the launch of the Tezos betanet in June 2018, early developers in the community such as Blockscale, Nomadic Labs, and Obsidian Systems developed a novel solution to help secure proof of stake networks: the remote signer. Instead of storing private keys on a validator node itself, where they might be compromised by an attacker, private keys could be stored in a remote system, which might include a hardware security module (HSM) to prevent the exfiltration of private key material. In addition, a monotonically increasing counter could be incremented as each vote was signed, which for the first time enabled a critical safety feature: double signing protection. If a vote had already been signed at that height, any attempt to sign another vote at the same height would be rejected, protecting the validator from the risk of slashing.
Coinbase Custody Validator Security
Coinbase Custody deploys our validators in accordance with the best practices in validator security. We leverage a remote signing solution that uses a mutex, or mutually exclusive lock. A mutex is a concept in computer science that is widely used in electronic trading systems and markets where downtime is not an option, but executing a trade twice would also be unacceptable. Multiple computer systems can attempt to execute a trade, for example, but only a single system will be able to acquire the mutex and execute the trade. This ensures that a trade is executed exactly once, and only once. When one of our validators wants to sign a vote, it must first acquire the mutex, at which point it will update the monotonically increasing counter, which represents the block height at which the vote is signed, and finally generate the cryptographic signature necessary to vote. If a validator attempts to sign a vote for a block height that has already been voted on, the monotonically increasing counter will prevent this, and if a validator attempts to update the counter, but another validator has already acquired the mutex, this update will fail, and signature generation will be prevented.
Testing and Safety
When designing and implementing highly secure distributed systems such as the remote signing solution, testing and safety are of paramount importance. For example, early in our development process, we discovered an edge case that might have led to double signing in special circumstances. In order to minimize these risks, we not only write comprehensive unit tests for every portion of the remote signer, we continuously run more than one validator on Tezos testnets, and all of them are attempting to sign messages at all times. This gives us continuous assurance that our double-signing protection is working at all times. We release new software updates to the testnet validators first, and let them operate for a period of time, before releasing updates to mainnet. By testing this critical safety feature continuously, we can minimize the risk of a software defect that leads to slashing.
The Challenge: Migrating to Ireland
Now that you have the background on how our remote signing solution protects against double signing and slashing, migrating our validator from the US to Ireland presented a unique challenge for us: the mutex lock we leverage requires strong consistency, and offers high availability by operating in multiple data centers in close geo-proximity to each other (less than 100 miles). We would need a separate mutex lock in Ireland, because light can’t travel fast enough across the Atlantic Ocean to provide the same strong consistency across such a long distance. Having a separate mutex lock means the two are no longer mutually exclusive, and the double signing risk during a migration is greatly increased.
Another challenge we faced is that Tezos validators require local access to the blockchain storage of a full Tezos node to create blocks. At the time of writing, this required over 300GB of storage that is tightly coupled to the validator. At Coinbase Custody, we leverage an automated deployment tool called Codeflow that rehydrates this data from secondary storage, and starts the node and validator. This process is fully automated to increase security and reduce the chance for human error, but due to the large amount of data it can take about an hour to complete.
Given our constraints, we considered a couple migration options:
- Completely stop the validator in the US, then start a deployment of the Ireland validator. This would result in about an hour of downtime while the Ireland validator rehydrated data from secondary storage, before the validator process was started, but it would ensure that we didn’t risk double-signing.
- Start the validator in Ireland and attempt to stop the US validator around the time that the Ireland validator had completed rehydrating data. This was deemed too risky because if there was any overlapping time where both validators were operating, we could potentially double sign and expose ourselves to slashing risk.
Both of these options seemed sub-optimal to us. In addition, our infrastructure in Ireland was brand new and had not been completely tested in production end-to-end, so the risk of a failed migration was great. We might have to roll back to the US infrastructure, which would require an additional hour of downtime. We began to consider a third, more creative option: decoupling the endorser.
Decoupling the Endorser
A Tezos validator has two main components: the baker, which produces new blocks that include transactions in the digital ledger, and the endorser, which votes on blocks that other validators produce. The baker, unlike the endorser, is tightly coupled to a local Tezos node, and requires access to the local storage to create new blocks. It is a lightweight, stateless client that only needs to communicate with a Tezos node over RPC, and the remote signer to generate signatures. The fundamental insight we gained was that by decoupling the endorser from the full Tezos node required to run a validator, we could perform an almost downtime-free migration. Large Tezos validators typically only produce or bake new blocks every few minutes or hours, but they need to vote or endorse almost every minute. We could find a window where we didn’t have to produce blocks for a couple hours, and migrate the baker during that time period, and we could migrate the endorser separately.
Completing the Migration
First, we decoupled the endorser. We created a new microservice that contained only the Tezos endorser client, and configured it to talk to the same Tezos edge nodes that we use to broadcast transactions to the rest of the Tezos network. These nodes do not run any validator processes, and there are 5 of them in both the US and Ireland to achieve better connectivity and availability. The endorser also needed to get signatures from the remote signer, just like the baker. If you recall the limitations of our mutex lock from earlier, we couldn’t have strong consistency across the Atlantic Ocean, as the distance was too great, so we knew we had to completely shut down the endorser in the US prior to starting it in Ireland. To be completely assured that we would not double sign or vote, we decided to stop the endorser in the US, wait for a single vote to be missed, and only then, start the endorser in Ireland. You can see this 1 minute/block downtime here:
Once we had successfully migrated the endorser, we were still producing new blocks in the US, while voting or endorsing from Ireland. We were then able to find a 2 hour window of time when no blocks needed to be produced to migrate the baker process, and complete the migration to Ireland.
Operating the largest Tezos validator with a significant amount of deposits or bonds at stake is not something Coinbase Custody takes lightly. We’ve invested heavily in the best possible security to minimize the risk of loss. I hope this post has been informative, and helped to educate you on the challenges of operating a large validator. At Coinbase Custody, we not only prioritize the security and protection of client funds, but we also strive to secure and protect the networks that we participate in on behalf of our clients. Investing in best in class security around our proof of stake validators is part of what makes us the most trusted digital asset custodian.