Class BloomFilterBuilder

java.lang.Object
org.apache.datasketches.filters.bloomfilter.BloomFilterBuilder

public final class BloomFilterBuilder extends Object
This class provides methods to help estimate the correct parameters when creating a Bloom filter, and methods to create the filter using those values.

The underlying math is described in the Wikipedia article on Bloom filters.

  • Constructor Summary

    Constructors
    Constructor
    Description
    No argument constructor.
  • Method Summary

    Modifier and Type
    Method
    Description
    createByAccuracy(long maxDistinctItems, double targetFalsePositiveProb)
    Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using a random base seed for the hash function.
    createByAccuracy(long maxDistinctItems, double targetFalsePositiveProb, long seed)
    Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using the provided base seed for the hash function.
    createBySize(long numBits, int numHashes)
    Creates a BloomFilter with given number of bits and number of hash functions, using a random base seed for the hash function.
    createBySize(long numBits, int numHashes, long seed)
    Creates a BloomFilter with given number of bits and number of hash functions, using the provided base seed for the hash function.
    static long
    Returns the minimum MemorySegment size, in bytes, needed for a serialized BloomFilter with the given number of bits.
    static long
    getSerializedFilterSizeByAccuracy(long maxDistinctItems, double targetFalsePositiveProb)
    Returns the minimum MemorySegment size, in bytes, needed for a serialized BloomFilter with an optimal number of bits and hash functions for the given inputs.
    initializeByAccuracy(long maxDistinctItems, double targetFalsePositiveProb, long seed, MemorySegment dstSeg)
    Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using the provided base seed for the hash function and writing into the provided MemorySegment.
    initializeByAccuracy(long maxDistinctItems, double targetFalsePositiveProb, MemorySegment dstSeg)
    Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using a random base seed for the hash function and writing into the provided MemorySegment.
    initializeBySize(long numBits, int numHashes, long seed, MemorySegment dstSeg)
    Initializes a BloomFilter with given number of bits and number of hash functions, using the provided base seed for the hash function and writing into the provided MemorySegment.
    initializeBySize(long numBits, int numHashes, MemorySegment dstSeg)
    Initializes a BloomFilter with given number of bits and number of hash functions, using a random base seed for the hash function and writing into the provided MemorySegment.
    static long
    suggestNumFilterBits(long maxDistinctItems, double targetFalsePositiveProb)
    Returns the optimal number of bits to use in a Bloom Filter given a target number of distinct items and a target false positive probability.
    static short
    suggestNumHashes(double targetFalsePositiveProb)
    Returns the optimal number of hash functions to achieve a target false positive probability.
    static short
    suggestNumHashes(long maxDistinctItems, long numFilterBits)
    Returns the optimal number of hash functions to given target numbers of distinct items and the BloomFilter size in bits.

    Methods inherited from class Object

    equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • BloomFilterBuilder

      public BloomFilterBuilder()
      No argument constructor.
  • Method Details

    • suggestNumHashes

      public static short suggestNumHashes(long maxDistinctItems, long numFilterBits)
      Returns the optimal number of hash functions to given target numbers of distinct items and the BloomFilter size in bits. This function will provide a result even if the input values exceed the capacity of a single BloomFilter.
      Parameters:
      maxDistinctItems - The maximum expected number of distinct items to add to the filter
      numFilterBits - The intended size of the Bloom Filter in bits
      Returns:
      The suggested number of hash functions to use with the filter
    • suggestNumHashes

      public static short suggestNumHashes(double targetFalsePositiveProb)
      Returns the optimal number of hash functions to achieve a target false positive probability.
      Parameters:
      targetFalsePositiveProb - A desired false positive probability per item
      Returns:
      The suggested number of hash functions to use with the filter.
    • suggestNumFilterBits

      public static long suggestNumFilterBits(long maxDistinctItems, double targetFalsePositiveProb)
      Returns the optimal number of bits to use in a Bloom Filter given a target number of distinct items and a target false positive probability.
      Parameters:
      maxDistinctItems - The maximum expected number of distinct items to add to the filter
      targetFalsePositiveProb - A desired false positive probability per item
      Returns:
      The suggested number of bits to use with the filter
    • getSerializedFilterSizeByAccuracy

      public static long getSerializedFilterSizeByAccuracy(long maxDistinctItems, double targetFalsePositiveProb)
      Returns the minimum MemorySegment size, in bytes, needed for a serialized BloomFilter with an optimal number of bits and hash functions for the given inputs. This is also the minimum size of a MemorySegment for in-place filter initialization.
      Parameters:
      maxDistinctItems - The maximum expected number of distinct items to add to the filter
      targetFalsePositiveProb - A desired false positive probability per item
      Returns:
      The size, in bytes, required to hold the specified BloomFilter when serialized
    • getSerializedFilterSize

      public static long getSerializedFilterSize(long numBits)
      Returns the minimum MemorySegment size, in bytes, needed for a serialized BloomFilter with the given number of bits. This is also the minimum size of a MemorySegment for in-place filter initialization.
      Parameters:
      numBits - The number of bits in the target BloomFilter's bit array.
      Returns:
      The size, in bytes, required to hold the specified BloomFilter when serialized
    • createByAccuracy

      public static BloomFilter createByAccuracy(long maxDistinctItems, double targetFalsePositiveProb)
      Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using a random base seed for the hash function.
      Parameters:
      maxDistinctItems - The maximum expected number of distinct items to add to the filter
      targetFalsePositiveProb - A desired false positive probability per item
      Returns:
      A new BloomFilter configured for the given input parameters
    • createByAccuracy

      public static BloomFilter createByAccuracy(long maxDistinctItems, double targetFalsePositiveProb, long seed)
      Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using the provided base seed for the hash function.
      Parameters:
      maxDistinctItems - The maximum expected number of distinct items to add to the filter
      targetFalsePositiveProb - A desired false positive probability per item
      seed - A base hash seed
      Returns:
      A new BloomFilter configured for the given input parameters
    • createBySize

      public static BloomFilter createBySize(long numBits, int numHashes)
      Creates a BloomFilter with given number of bits and number of hash functions, using a random base seed for the hash function.
      Parameters:
      numBits - The size of the BloomFilter, in bits
      numHashes - The number of hash functions to apply to items
      Returns:
      A new BloomFilter configured for the given input parameters
    • createBySize

      public static BloomFilter createBySize(long numBits, int numHashes, long seed)
      Creates a BloomFilter with given number of bits and number of hash functions, using the provided base seed for the hash function.
      Parameters:
      numBits - The size of the BloomFilter, in bits
      numHashes - The number of hash functions to apply to items
      seed - A base hash seed
      Returns:
      A new BloomFilter configured for the given input parameters
    • initializeByAccuracy

      public static BloomFilter initializeByAccuracy(long maxDistinctItems, double targetFalsePositiveProb, MemorySegment dstSeg)
      Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using a random base seed for the hash function and writing into the provided MemorySegment.
      Parameters:
      maxDistinctItems - The maximum expected number of distinct items to add to the filter
      targetFalsePositiveProb - A desired false positive probability per item
      dstSeg - A MemorySegment to hold the initialized filter
      Returns:
      A new BloomFilter configured for the given input parameters
    • initializeByAccuracy

      public static BloomFilter initializeByAccuracy(long maxDistinctItems, double targetFalsePositiveProb, long seed, MemorySegment dstSeg)
      Creates a new BloomFilter with an optimal number of bits and hash functions for the given inputs, using the provided base seed for the hash function and writing into the provided MemorySegment.
      Parameters:
      maxDistinctItems - The maximum expected number of distinct items to add to the filter
      targetFalsePositiveProb - A desired false positive probability per item
      seed - A base hash seed
      dstSeg - A MemorySegment to hold the initialized filter
      Returns:
      A new BloomFilter configured for the given input parameters
    • initializeBySize

      public static BloomFilter initializeBySize(long numBits, int numHashes, MemorySegment dstSeg)
      Initializes a BloomFilter with given number of bits and number of hash functions, using a random base seed for the hash function and writing into the provided MemorySegment.
      Parameters:
      numBits - The size of the BloomFilter, in bits
      numHashes - The number of hash functions to apply to items
      dstSeg - A MemorySegment to hold the initialized filter
      Returns:
      A new BloomFilter configured for the given input parameters
    • initializeBySize

      public static BloomFilter initializeBySize(long numBits, int numHashes, long seed, MemorySegment dstSeg)
      Initializes a BloomFilter with given number of bits and number of hash functions, using the provided base seed for the hash function and writing into the provided MemorySegment.
      Parameters:
      numBits - The size of the BloomFilter, in bits
      numHashes - The number of hash functions to apply to items
      seed - A base hash seed
      dstSeg - A MemorySegment to hold the initialized filter
      Returns:
      A new BloomFilter configured for the given input parameters