Thinking Out Loud - Consensus Quorum Bootstrapping

Note: You may want to read the the Paxos paper, the Raft paper, the Chubby paper, the Spanner paper, and the F1 SQL layer paper before trying to read this -- it won't make sense unless you're familiar with the concept of multiple consistency domains.

The following is a simple algorithm for bootstrapping a Raft consensus quorum in a multiple consistency domain system (which utilizes a location quorum for discovering relationships between replicas and namespaces). Also, I mention a mistake I made in my initial implementation which led me to this design.

First, let's cover what the goals of bootstrapping a consensus quorum group actually means. For one thing, we need to satisfy (2f+1) for failures. So, for a minimal quorum, we want our bootstrapped group to handle a single failure. We don't need to go higher than a single failure in the bootstrap phase since we can add additional follower replicas later on with proper consistency across a term change. Because of this, the bootstrap process will always be based on creating a group with exactly three nodes (two followers and a single leader).

The first mistake.

Early on in the development an implementation of this, the assumption was made that I should first bootstrap a leader, then tell the leader what followers are attached to it. This ended up being difficult and unwieldy. It is much cleaner to utilize the consensus voting to bootstrap replicas in a follower state so they naturally become leaders through a proper candidacy. This avoids the difficult task of manually configuring state for a replica to directly come into existence as a leader. The current bootstrap process is the result of learning from the mistake of attempting to directly bootstrap a leader.

The current design.

First, we bootstrap an idle follower. The definition of "idle" in this case is the lack of (2f+1) satisfaction as well as an empty remote hook array (outgoing socket state). The only operation an idle follower will perform is to wait on join follower messages or shutdown messages.

The next step is to join an additional follower replica to the idle follower. When this happens, a join response message is sent back to the new idle follower with any replicas that the original idle follower knows about (zero in this case). The original follower also attempts to send a join message response to any known replica hooks (still zero in the case of the original follower). Now, we have two idle followers with knowledge of each other.

When a third follower joins the initial two idle followers, we can finally satisfy (2f+1::f==1). The third follower will send a join request command (control join). It can send this command to either of the two existing idle followers. When the first or second idle follower receives this message, it will send out a control join response to any hook replicas it already knows about (while avoiding a loop by only sending a message with three replicas one time). Once this happens, each follower will have three known hook replicas. Since this satisfies our requirements for a quorum, the idle followers will then independently know to leave the bootstrap state at this point and begin their election timers. Eventually, one follower will win a candidate term and begin a leader term through the regular consensus failover process.

A finishing touch.

However, we're not done yet. Once a leader is elected, it will check to see if it has knowledge of the namespace it should write into. If the namespace does not exist, the leader will not respond to data messages and will only respond to control bootstrap leader namespace messages. Once the leader is told what namespace it's using, it will wait for the location quorum registration message (data write) to successfully respond before processing client data commands. The exception to this process is the bootstrap of the location quorum leader itself which will simply use the provided bootstrap namespace when it satisfies the criteria of a location quorum namespace pattern.