Bookie Ledger Recovery

When a Bookie crashes, any ledgers with data on that Bookie become underreplicated. There are two options for bringing the ledgers back to full replication, Autorecovery and Manual Bookie Recovery.

Autorecovery

Autorecovery runs as a daemon alongside the Bookie daemon on each Bookie. Autorecovery detects when a bookie in the cluster has become unavailable, and rereplicates all the ledgers which were on that bookie, so that those ledgers are brough back to full replication. See the Admin Guide for instructions on how to start autorecovery.

Manual Bookie Recovery

If autorecovery is not enabled, it is possible for the adminisatrator to manually rereplicate the data from the failed bookie.

To run recovery, with zk1.example.com as the zookeeper ensemble, and 192.168.1.10 as the failed bookie, do the following:

bookkeeper-server/bin/bookkeeper org.apache.bookkeeper.tools.BookKeeperTools zk1.example.com:2181 192.168.1.10:3181

It is necessary to specify the host and port portion of failed bookie, as this is how it identifies itself to zookeeper. It is possible to specify a third argument, which is the bookie to replicate to. If this is omitted, as in our example, a random bookie is chosen for each ledger segment. A ledger segment is a continuous sequence of entries in a bookie, which share the same ensemble.

AutoRecovery Internals

Auto-Recovery has two components:

Auditor, a singleton node which watches for bookie failure, and creates rereplication tasks for the ledgers on failed bookies.
ReplicationWorker, runs on each Bookie, takes rereplication tasks and executes them.

Both the components run as threads in the the AutoRecoveryMain process. The AutoRecoveryMain process runs on each Bookie in the cluster. All recovery nodes will participate in leader election to decide which node becomes the auditor. Those which fail to become the auditor will watch the elected auditor, and will run election again if they see that it has failed.

Auditor

The auditor watches the the list of bookies registered with ZooKeeper in the cluster. A Bookie registers with ZooKeeper during startup. If the bookie crashes or is killed, the bookie's registration disappears. The auditor is notified of changes in the registered bookies list.

When the auditor sees that a bookie has disappeared from the list, it immediately scans the complete ledger list to find ledgers which have stored data on the failed bookie. Once it has a list of ledgers which need to be rereplicated, it will publish a rereplication task for each ledger under the /underreplicated/ znode in ZooKeeeper.

ReplicationWorker

Each replication worker watches for tasks being published in the /underreplicated/ znode. When a new task appears, it will try to get a lock on it. If it cannot acquire the lock, it tries the next entry. The locks are implemented using ZooKeeper ephemeral znodes.

The replication worker will scan through the rereplication task's ledger for segments of which its local bookie is not a member. When it finds segments matching this criteria it will replicate the entries of that segment to the local bookie. If, after this process, the ledger is fully replicated, the ledgers entry under /underreplicated/ is deleted, and the lock is released. If there is a problem replicating, or there are still segments in the ledger which are still underreplicated (due to the local bookie already being part of the ensemble for the segment), then the lock is simply released.

If the replication worker finds a segment which needs rereplication, but does not have a defined endpoint (i.e. the final segment of a ledger currently being written to), it will wait for a grace period before attempting rereplication. If the segment needing rereplciation still does not have a defined endpoint, the ledger is fenced and rereplication then takes place. This avoids the case where a client is writing to a ledger, and one of the bookies goes down, but the client has not written an entry to that bookie before rereplication takes place. The client could continue writing to the old segment, even though the ensemble for the segment had changed. This could lead to data loss. Fencing prevents this scenario from happening. In the normal case, the client will try to write to the failed bookie within the grace period, and will have started a new segment before rereplication starts. See the Admin Guide for how to configure this grace period.

The Rereplication process

The ledger rereplication process is as follows.

The client goes through all ledger segments in the ledger, selecting those which contain the failed bookie;
A recovery process is initiated for each ledger segment in this list;
1. The client selects a bookie to which all entries in the ledger segment will be replicated; In the case of autorecovery, this will always be the local bookie;
2. the client reads entries that belong to the ledger segment from other bookies in the ensemble and writes them to the selected bookie;
3. Once all entries have been replicated, the zookeeper metadata for the segment is updated to reflect the new ensemble;
4. The segment is marked as fully replicated in the recovery tool;
Once all ledger segments are marked as fully replicated, the ledger is marked as fully replicated.

The Manual Bookie Recovery process

The manual bookie recovery process is as follows.

The client reads the metadata of active ledgers from zookeeper;
From this, the ledgers which contain segments using the failed bookie in their ensemble are selected;
A recovery process is initiated for each ledger in this list;
1. The Ledger rereplication process is run for each ledger;
Once all ledgers are marked as fully replicated, bookie recovery is finished.