Interface EnsemblePlacementPolicy
-
- All Known Subinterfaces:
ITopologyAwareEnsemblePlacementPolicy<T>
- All Known Implementing Classes:
DefaultEnsemblePlacementPolicy
,LocalBookieEnsemblePlacementPolicy
,RackawareEnsemblePlacementPolicy
,RackawareEnsemblePlacementPolicyImpl
,RegionAwareEnsemblePlacementPolicy
,TopologyAwareEnsemblePlacementPolicy
,ZoneawareEnsemblePlacementPolicy
,ZoneawareEnsemblePlacementPolicyImpl
@Public @Evolving public interface EnsemblePlacementPolicy
EnsemblePlacementPolicy
encapsulates the algorithm that bookkeeper client uses to select a number of bookies from the cluster as an ensemble for storing entries.The algorithm is typically implemented based on the data input as well as the network topology properties.
How does it work?
This interface basically covers three parts:
- Initialization and uninitialization
- How to choose bookies to place data
- How to choose bookies to do speculative reads
Initialization and uninitialization
The ensemble placement policy is constructed by jvm reflection during constructing bookkeeper client. After the
EnsemblePlacementPolicy
is constructed, bookkeeper client will callinitialize(ClientConfiguration, Optional, HashedWheelTimer, FeatureProvider, StatsLogger, BookieAddressResolver)
to initialize the placement policy.The
initialize(ClientConfiguration, Optional, HashedWheelTimer, FeatureProvider, StatsLogger, BookieAddressResolver)
method takes a few resources from bookkeeper for instantiating itself. These resources include:- `ClientConfiguration` : The client configuration that used for constructing the bookkeeper client. The implementation of the placement policy could obtain its settings from this configuration.
- `DNSToSwitchMapping`: The DNS resolver for the ensemble policy to build the network topology of the bookies cluster. It is optional.
- `HashedWheelTimer`: A hashed wheel timer that could be used for timing related work. For example, a stabilize network topology could use it to delay network topology changes to reduce impacts of flapping bookie registrations due to zk session expires.
- `FeatureProvider`: A
FeatureProvider
that the policy could use for enabling or disabling its offered features. For example, aRegionAwareEnsemblePlacementPolicy
could offer features to disable placing data to a specific region at runtime. - `StatsLogger`: A
StatsLogger
for exposing stats.
The ensemble placement policy is a single instance per bookkeeper client. The instance will be
uninitalize()
when closing the bookkeeper client. The implementation of a placement policy should be responsible for releasing all the resources that allocated duringinitialize(ClientConfiguration, Optional, HashedWheelTimer, FeatureProvider, StatsLogger, BookieAddressResolver)
.How to choose bookies to place data
The bookkeeper client discovers list of bookies from zookeeper via
BookieWatcher
- whenever there are bookie changes, the ensemble placement policy will be notified with new list of bookies viaonClusterChanged(Set, Set)
. The implementation of the ensemble placement policy will react on those changes to build new network topology. Subsequent operations likenewEnsemble(int, int, int, Map, Set)
orreplaceBookie(int, int, int, java.util.Map, java.util.List, BookieId, java.util.Set)
hence can operate on the new network topology.Both
RackawareEnsemblePlacementPolicy
andRegionAwareEnsemblePlacementPolicy
areTopologyAwareEnsemblePlacementPolicy
s. They build aNetworkTopology
on bookie changes, use it for ensemble placement and ensurerack/region
coverage for write quorums.Network Topology
The network topology is presenting a cluster of bookies in a tree hierarchical structure. For example, a bookie cluster may be consists of many data centers (aka regions) filled with racks of machines. In this tree structure, leaves represent bookies and inner nodes represent switches/routes that manage traffic in/out of regions or racks.
For example, there are 3 bookies in region `A`. They are `bk1`, `bk2` and `bk3`. And their network locations are
/region-a/rack-1/bk1
,/region-a/rack-1/bk2
and/region-a/rack-2/bk3
. So the network topology will look like below:root | region-a / \ rack-1 rack-2 / \ \ bk1 bk2 bk3
Another example, there are 4 bookies spanning in two regions `A` and `B`. They are `bk1`, `bk2`, `bk3` and `bk4`. And their network locations are
/region-a/rack-1/bk1
,/region-a/rack-1/bk2
,/region-b/rack-2/bk3
and/region-b/rack-2/bk4
. The network topology will look like below:root / \ region-a region-b | | rack-1 rack-2 / \ / \ bk1 bk2 bk3 bk4
The network location of each bookie is resolved by a
DNSToSwitchMapping
. TheDNSToSwitchMapping
resolves a list of DNS-names or IP-addresses into a list of network locations. The network location that is returned must be a network path of the form `/region/rack`, where `/` is the root, and `region` is the region id representing the data center where `rack` is located. The network topology of the bookie cluster would determine the number ofRackAware and RegionAware
RackawareEnsemblePlacementPolicy
basically just chooses bookies from different racks in the built network topology. It guarantees that a write quorum will cover at least two racks. It expects the network locations resolved byDNSToSwitchMapping
have at least 2 levels. For example, network location paths like/dc1/rack0
and/dc1/row1/rack0
are okay, but/rack0
is not acceptable.RegionAwareEnsemblePlacementPolicy
is a hierarchical placement policy, which it chooses equal-sized bookies from regions, and within each region it usesRackawareEnsemblePlacementPolicy
to choose bookies from racks. For example, if there is 3 regions -region-a
,region-b
andregion-c
, an application want to allocate a15-bookies
ensemble. First, it would figure out there are 3 regions and it should allocate 5 bookies from each region. Second, for each region, it would useRackawareEnsemblePlacementPolicy
to choose 5 bookies.Since
RegionAwareEnsemblePlacementPolicy
is based onRackawareEnsemblePlacementPolicy
, it expects the network locations resolved byDNSToSwitchMapping
have at least 3 levels.How to choose bookies to do speculative reads?
reorderReadSequence(List, BookiesHealthInfo, WriteSet)
andreorderReadLACSequence(List, BookiesHealthInfo, WriteSet)
are two methods exposed by the placement policy, to help client determine a better read sequence according to the network topology and the bookie failure history.For example, in
RackawareEnsemblePlacementPolicy
, the reads will be attempted in following sequence:- bookies are writable and didn't experience failures before
- bookies are writable and experienced failures before
- bookies are readonly
- bookies already disappeared from network topology
In
RegionAwareEnsemblePlacementPolicy
, the reads will be tried in similar following sequence as `RackAware` placement policy. There is a slight different on trying writable bookies: after trying every 2 bookies from local region, it would try a bookie from remote region. Hence it would achieve low latency even there is network issues within local region.How to configure the placement policy?
Currently there are 3 implementations available by default. They are:
You can configure the ensemble policy by specifying the placement policy class in
ClientConfiguration.setEnsemblePlacementPolicy(Class)
.DefaultEnsemblePlacementPolicy
randomly pickups bookies from the cluster, while bothRackawareEnsemblePlacementPolicy
andRegionAwareEnsemblePlacementPolicy
choose bookies based on network locations. So you might also consider configuring a properDNSToSwitchMapping
inBookKeeper.Builder
to resolve the correct network locations for your cluster.
-
-
Nested Class Summary
Nested Classes Modifier and Type Interface Description static class
EnsemblePlacementPolicy.PlacementPolicyAdherence
enum for PlacementPolicyAdherence.static class
EnsemblePlacementPolicy.PlacementResult<T>
Result of a placement calculation against a placement policy.
-
Method Summary
All Methods Instance Methods Abstract Methods Default Methods Modifier and Type Method Description default boolean
areAckedBookiesAdheringToPlacementPolicy(java.util.Set<BookieId> ackedBookies, int writeQuorumSize, int ackQuorumSize)
Returns true if the bookies that have acknowledged a write adhere to the minimum fault domains as defined in the placement policy in use.default int
getStickyReadBookieIndex(LedgerMetadata metadata, java.util.Optional<java.lang.Integer> currentStickyBookieIndex)
Select one bookie to the "sticky" bookie where all reads for a particular ledger will be directed to.EnsemblePlacementPolicy
initialize(ClientConfiguration conf, java.util.Optional<DNSToSwitchMapping> optionalDnsResolver, io.netty.util.HashedWheelTimer hashedWheelTimer, FeatureProvider featureProvider, StatsLogger statsLogger, BookieAddressResolver bookieAddressResolver)
Initialize the policy.default EnsemblePlacementPolicy.PlacementPolicyAdherence
isEnsembleAdheringToPlacementPolicy(java.util.List<BookieId> ensembleList, int writeQuorumSize, int ackQuorumSize)
returns AdherenceLevel if the Ensemble is strictly/softly/fails adhering to placement policy, like in the case of RackawareEnsemblePlacementPolicy, bookies in the writeset are from 'minNumRacksPerWriteQuorum' number of racks.EnsemblePlacementPolicy.PlacementResult<java.util.List<BookieId>>
newEnsemble(int ensembleSize, int writeQuorumSize, int ackQuorumSize, java.util.Map<java.lang.String,byte[]> customMetadata, java.util.Set<BookieId> excludeBookies)
Choose numBookies bookies for ensemble.java.util.Set<BookieId>
onClusterChanged(java.util.Set<BookieId> writableBookies, java.util.Set<BookieId> readOnlyBookies)
A consistent view of the cluster (what bookies are available as writable, what bookies are available as readonly) is updated when any changes happen in the cluster.void
registerSlowBookie(BookieId bookieSocketAddress, long entryId)
Register a bookie as slow so that it is tried after available and read-only bookies.DistributionSchedule.WriteSet
reorderReadLACSequence(java.util.List<BookieId> ensemble, BookiesHealthInfo bookiesHealthInfo, DistributionSchedule.WriteSet writeSet)
Reorder the read last add confirmed sequence of a given write quorum writeSet.DistributionSchedule.WriteSet
reorderReadSequence(java.util.List<BookieId> ensemble, BookiesHealthInfo bookiesHealthInfo, DistributionSchedule.WriteSet writeSet)
Reorder the read sequence of a given write quorum writeSet.EnsemblePlacementPolicy.PlacementResult<BookieId>
replaceBookie(int ensembleSize, int writeQuorumSize, int ackQuorumSize, java.util.Map<java.lang.String,byte[]> customMetadata, java.util.List<BookieId> currentEnsemble, BookieId bookieToReplace, java.util.Set<BookieId> excludeBookies)
Choose a new bookie to replace bookieToReplace.default EnsemblePlacementPolicy.PlacementResult<java.util.List<BookieId>>
replaceToAdherePlacementPolicy(int ensembleSize, int writeQuorumSize, int ackQuorumSize, java.util.Set<BookieId> excludeBookies, java.util.List<BookieId> currentEnsemble)
Returns placement result.void
uninitalize()
Uninitialize the policy.default void
updateBookieInfo(java.util.Map<BookieId,BookieInfoReader.BookieInfo> bookieInfoMap)
Send the bookie info details.
-
-
-
Method Detail
-
initialize
EnsemblePlacementPolicy initialize(ClientConfiguration conf, java.util.Optional<DNSToSwitchMapping> optionalDnsResolver, io.netty.util.HashedWheelTimer hashedWheelTimer, FeatureProvider featureProvider, StatsLogger statsLogger, BookieAddressResolver bookieAddressResolver)
Initialize the policy.- Parameters:
conf
- client configurationoptionalDnsResolver
- dns resolverhashedWheelTimer
- timerfeatureProvider
- feature providerstatsLogger
- stats logger- Since:
- 4.5
-
uninitalize
void uninitalize()
Uninitialize the policy.
-
onClusterChanged
java.util.Set<BookieId> onClusterChanged(java.util.Set<BookieId> writableBookies, java.util.Set<BookieId> readOnlyBookies)
A consistent view of the cluster (what bookies are available as writable, what bookies are available as readonly) is updated when any changes happen in the cluster.The implementation should take actions when the cluster view is changed. So subsequent
newEnsemble(int, int, int, Map, Set)
andreplaceBookie(int, int, int, java.util.Map, java.util.List, BookieId, java.util.Set)
can choose proper bookies.- Parameters:
writableBookies
- All the bookies in the cluster available for write/read.readOnlyBookies
- All the bookies in the cluster available for readonly.- Returns:
- the dead bookies during this cluster change.
-
newEnsemble
EnsemblePlacementPolicy.PlacementResult<java.util.List<BookieId>> newEnsemble(int ensembleSize, int writeQuorumSize, int ackQuorumSize, java.util.Map<java.lang.String,byte[]> customMetadata, java.util.Set<BookieId> excludeBookies) throws BKException.BKNotEnoughBookiesException
Choose numBookies bookies for ensemble. If the count is more than the number of available nodes,BKException.BKNotEnoughBookiesException
is thrown.The implementation should respect to the replace settings. The size of the returned bookie list should be equal to the provide
ensembleSize
.customMetadata
is the same user defined data that user provides whenBookKeeper.createLedger(int, int, int, BookKeeper.DigestType, byte[], Map)
.If 'enforceMinNumRacksPerWriteQuorum' config is enabled then the bookies belonging to default faultzone (rack) will be excluded while selecting bookies.
- Parameters:
ensembleSize
- Ensemble SizewriteQuorumSize
- Write Quorum SizeackQuorumSize
- the value of ackQuorumSize (added since 4.5)customMetadata
- the value of customMetadata. it is the same user defined metadata that user provides inBookKeeper.createLedger(int, int, int, BookKeeper.DigestType, byte[])
excludeBookies
- Bookies that should not be considered as targets.- Returns:
- a placement result containing list of bookie addresses for the ensemble.
- Throws:
BKException.BKNotEnoughBookiesException
- if not enough bookies available.
-
replaceBookie
EnsemblePlacementPolicy.PlacementResult<BookieId> replaceBookie(int ensembleSize, int writeQuorumSize, int ackQuorumSize, java.util.Map<java.lang.String,byte[]> customMetadata, java.util.List<BookieId> currentEnsemble, BookieId bookieToReplace, java.util.Set<BookieId> excludeBookies) throws BKException.BKNotEnoughBookiesException
Choose a new bookie to replace bookieToReplace. If no bookie available in the cluster,BKException.BKNotEnoughBookiesException
is thrown.If 'enforceMinNumRacksPerWriteQuorum' config is enabled then the bookies belonging to default faultzone (rack) will be excluded while selecting bookies.
- Parameters:
ensembleSize
- the value of ensembleSizewriteQuorumSize
- the value of writeQuorumSizeackQuorumSize
- the value of ackQuorumSize (added since 4.5)customMetadata
- the value of customMetadata. it is the same user defined metadata that user provides inBookKeeper.createLedger(int, int, int, BookKeeper.DigestType, byte[])
currentEnsemble
- the value of currentEnsemblebookieToReplace
- bookie to replaceexcludeBookies
- bookies that should not be considered as candidate.- Returns:
- a placement result containing the new bookie address.
- Throws:
BKException.BKNotEnoughBookiesException
-
registerSlowBookie
void registerSlowBookie(BookieId bookieSocketAddress, long entryId)
Register a bookie as slow so that it is tried after available and read-only bookies.- Parameters:
bookieSocketAddress
- Address of bookie hostentryId
- Entry ID that caused a speculative timeout on the bookie.
-
reorderReadSequence
DistributionSchedule.WriteSet reorderReadSequence(java.util.List<BookieId> ensemble, BookiesHealthInfo bookiesHealthInfo, DistributionSchedule.WriteSet writeSet)
Reorder the read sequence of a given write quorum writeSet.- Parameters:
ensemble
- Ensemble to read entries.bookiesHealthInfo
- Health info for bookieswriteSet
- Write quorum to read entries. This will be modified, rather than allocating a new WriteSet.- Returns:
- The read sequence. This will be the same object as the passed in writeSet.
- Since:
- 4.5
-
reorderReadLACSequence
DistributionSchedule.WriteSet reorderReadLACSequence(java.util.List<BookieId> ensemble, BookiesHealthInfo bookiesHealthInfo, DistributionSchedule.WriteSet writeSet)
Reorder the read last add confirmed sequence of a given write quorum writeSet.- Parameters:
ensemble
- Ensemble to read entries.bookiesHealthInfo
- Health info for bookieswriteSet
- Write quorum to read entries. This will be modified, rather than allocating a new WriteSet.- Returns:
- The read sequence. This will be the same object as the passed in writeSet.
- Since:
- 4.5
-
updateBookieInfo
default void updateBookieInfo(java.util.Map<BookieId,BookieInfoReader.BookieInfo> bookieInfoMap)
Send the bookie info details.- Parameters:
bookieInfoMap
- A map that has the bookie to BookieInfo- Since:
- 4.5
-
getStickyReadBookieIndex
default int getStickyReadBookieIndex(LedgerMetadata metadata, java.util.Optional<java.lang.Integer> currentStickyBookieIndex)
Select one bookie to the "sticky" bookie where all reads for a particular ledger will be directed to.The default implementation will pick a bookie randomly from the ensemble. Other placement policies will be able to do better decisions based on additional information (eg: rack or region awareness).
- Parameters:
metadata
- theLedgerMetadata
objectcurrentStickyBookieIndex
- if we are changing the sticky bookie after a read failure, the current sticky bookie is passed in so that we will avoid choosing it again- Returns:
- the index, within the ensemble of the bookie chosen as the sticky bookie
- Since:
- 4.9
-
isEnsembleAdheringToPlacementPolicy
default EnsemblePlacementPolicy.PlacementPolicyAdherence isEnsembleAdheringToPlacementPolicy(java.util.List<BookieId> ensembleList, int writeQuorumSize, int ackQuorumSize)
returns AdherenceLevel if the Ensemble is strictly/softly/fails adhering to placement policy, like in the case of RackawareEnsemblePlacementPolicy, bookies in the writeset are from 'minNumRacksPerWriteQuorum' number of racks. And in the case of RegionawareEnsemblePlacementPolicy, check for minimumRegionsForDurability, reppRegionsToWrite, rack distribution within a region and other parameters of RegionAwareEnsemblePlacementPolicy. In ZoneAwareEnsemblePlacementPolicy if bookies in the writeset are from 'desiredNumOfZones' then it is considered as MEETS_STRICT if they are from 'minNumOfZones' then it is considered as MEETS_SOFT otherwise considered as FAIL.- Parameters:
ensembleList
- list of BookieId of bookies in the ensemblewriteQuorumSize
- writeQuorumSize of the ensembleackQuorumSize
- ackQuorumSize of the ensemble- Returns:
-
areAckedBookiesAdheringToPlacementPolicy
default boolean areAckedBookiesAdheringToPlacementPolicy(java.util.Set<BookieId> ackedBookies, int writeQuorumSize, int ackQuorumSize)
Returns true if the bookies that have acknowledged a write adhere to the minimum fault domains as defined in the placement policy in use. Ex: In the case of RackawareEnsemblePlacementPolicy, bookies belong to at least 'minNumRacksPerWriteQuorum' number of racks.- Parameters:
ackedBookies
- list of BookieId of bookies that have acknowledged a write.writeQuorumSize
- writeQuorumSize of the ensembleackQuorumSize
- ackQuorumSize of the ensemble- Returns:
-
replaceToAdherePlacementPolicy
default EnsemblePlacementPolicy.PlacementResult<java.util.List<BookieId>> replaceToAdherePlacementPolicy(int ensembleSize, int writeQuorumSize, int ackQuorumSize, java.util.Set<BookieId> excludeBookies, java.util.List<BookieId> currentEnsemble)
Returns placement result. If the currentEnsemble is not adhering placement policy, returns new ensemble that adheres placement policy. It should be implemented so as to minify the number of bookies replaced.- Parameters:
ensembleSize
- ensemble sizewriteQuorumSize
- writeQuorumSize of the ensembleackQuorumSize
- ackQuorumSize of the ensembleexcludeBookies
- bookies that should not be considered as targetscurrentEnsemble
- current ensemble- Returns:
- a placement result
-
-