When a Bookie crashes, any ledgers with data on that Bookie become underreplicated. There are two options for bringing the ledgers back to full replication, Autorecovery and Manual Bookie Recovery.
Autorecovery runs as a daemon alongside the Bookie daemon on each Bookie. Autorecovery detects when a bookie in the cluster has become unavailable, and rereplicates all the ledgers which were on that bookie, so that those ledgers are brough back to full replication. See the Admin Guide for instructions on how to start autorecovery.
If autorecovery is not enabled, it is possible for the adminisatrator to manually rereplicate the data from the failed bookie.
To run recovery, with zk1.example.com as the zookeeper ensemble, and 192.168.1.10 as the failed bookie, do the following:
bookkeeper-server/bin/bookkeeper org.apache.bookkeeper.tools.BookKeeperTools zk1.example.com:2181 192.168.1.10:3181
It is necessary to specify the host and port portion of failed bookie, as this is how it identifies itself to zookeeper. It is possible to specify a third argument, which is the bookie to replicate to. If this is omitted, as in our example, a random bookie is chosen for each ledger segment. A ledger segment is a continuous sequence of entries in a bookie, which share the same ensemble.
Auto-Recovery has two components:
Both the components run as threads in the the AutoRecoveryMain process. The AutoRecoveryMain process runs on each Bookie in the cluster. All recovery nodes will participate in leader election to decide which node becomes the auditor. Those which fail to become the auditor will watch the elected auditor, and will run election again if they see that it has failed.
The auditor watches the the list of bookies registered with ZooKeeper in the cluster. A Bookie registers with ZooKeeper during startup. If the bookie crashes or is killed, the bookie's registration disappears. The auditor is notified of changes in the registered bookies list.
When the auditor sees that a bookie has disappeared from the list, it immediately scans the complete ledger list to find ledgers which have stored data on the failed bookie. Once it has a list of ledgers which need to be rereplicated, it will publish a rereplication task for each ledger under the /underreplicated/ znode in ZooKeeeper.
Each replication worker watches for tasks being published in the /underreplicated/ znode. When a new task appears, it will try to get a lock on it. If it cannot acquire the lock, it tries the next entry. The locks are implemented using ZooKeeper ephemeral znodes.
The replication worker will scan through the rereplication task's ledger for segments of which its local bookie is not a member. When it finds segments matching this criteria it will replicate the entries of that segment to the local bookie. If, after this process, the ledger is fully replicated, the ledgers entry under /underreplicated/ is deleted, and the lock is released. If there is a problem replicating, or there are still segments in the ledger which are still underreplicated (due to the local bookie already being part of the ensemble for the segment), then the lock is simply released.
If the replication worker finds a segment which needs rereplication, but does not have a defined endpoint (i.e. the final segment of a ledger currently being written to), it will wait for a grace period before attempting rereplication. If the segment needing rereplciation still does not have a defined endpoint, the ledger is fenced and rereplication then takes place. This avoids the case where a client is writing to a ledger, and one of the bookies goes down, but the client has not written an entry to that bookie before rereplication takes place. The client could continue writing to the old segment, even though the ensemble for the segment had changed. This could lead to data loss. Fencing prevents this scenario from happening. In the normal case, the client will try to write to the failed bookie within the grace period, and will have started a new segment before rereplication starts. See the Admin Guide for how to configure this grace period.
The ledger rereplication process is as follows.
The manual bookie recovery process is as follows.