Implementing Bloom Filters in Python for Efficient Set Membership Testing • Llego.dev

A Bloom filter is a probabilistic data structure designed for efficiently testing set membership. Bloom filters have wide applications in databases, network routing, and cryptography. In this comprehensive guide, we will learn how Bloom filters work and implement one in Python.

Overview

A Bloom filter is a space-efficient set data structure that enables quickly querying whether an element is a member of a set. It may yield false positives but never false negatives. Bloom filters have superior space and time complexities compared to other set data structures.

The key idea behind a Bloom filter is to allocate a bit array of size m and k independent hash functions mapping elements to random array positions. Initially, all bits are 0. To add an element, feed it to the k hashes and set those positions to 1. To query, check if all k bit positions for that element are 1. If any is 0, the element is definitely not in. If all are 1, it most likely is, but there is a small probability of false positives.

Bloom filters occupy constant space regardless of the elements added since the filter size is fixed beforehand. The false positive rate can be tuned by adjusting m and k. Larger m lowers the false positive probability for a given number of elements n. More hashes k also improves the accuracy.

In Python, we will implement a BloomFilter class supporting add(), contains(), and basic operations. Hashing will rely on the mmh3 murmurhash library. We will also analyze the false positive probability and see examples. This guide assumes basic knowledge of Python and hash tables. The intended audience is software engineers and developers interested in data structures and algorithm implementations in Python.

Applications of Bloom Filters

Bloom filters excel in applications where space is critical and a small false positive probability is acceptable. Some common uses:

Databases: Quickly check if a record exists or not before doing expensive disk lookups.
Networking: Perform IP address whitelisting on routers using compact Bloom filters.
Security: Detect malicious URLs, IP addresses, or spam emails via blacklist Bloom filters.
Caching: Determine if a web page is in cache or not before querying the backend.
Genomics: Test for membership of a genome sequence in a reference dataset of sequences.

The key advantage is the minimal space overhead compared to storing elements directly. This allows fitting large whitelist or blacklist filters even on resource-constrained devices. When false positives are tolerable, Bloom filters provide immense space savings.

How Bloom Filters Work

The core mechanism behind Bloom filters relies on hash functions mapping elements to random bit array positions uniformly. Specifically, a Bloom filter requires:

A bit array of size m initialized to 0.
k independent hash functions, each outputting values between 0 and m-1.

To add an element, feed it to the k hashes to get k array positions. Set those bits to 1.

To check if an item is in the set, again hash it k times to get k array positions. If any of those bits are 0, the item is definitely not in. If all are 1, it most likely is, but there is a small false positive chance also due to other set members hashing to the same bits.

This approach of using multiple hashes drastically lowers the false positive probability compared to a single hash. The more hashes, the lower the probability since an element must hash to all 1 bits to be considered in the set. We will mathematically derive this probability next.

Here is a simple example to illustrate the workflow:

from mmh3 import hash as mmh3_hash

m = 16   # Size of bit array
k = 4    # Number of hash functions

bit_array = [0] * m
hashes = [lambda x, i=i: mmh3_hash(x, i) % m for i in range(k)]

# Add element 'foo'
for i in range(k):
    index = hashes[i]('foo')
    bit_array[index] = 1

# Check if 'bar' exists
if all(bit_array[hashes[i]('bar')] == 1 for i in range(k)):
   print('bar most likely exists')
else:
   print('bar does not exist')

This implements the core functionality in a few lines. Note the correction in the hashes list comprehension to properly bind the value of i. Next, let’s analyze the probability math and tune the parameters.

Probability of False Positives

The chance of a false positive for a query depends on:

m - Size of the bit array
n - Number of elements inserted
k - Number of hash functions

After inserting n elements, the probability that a specific bit is still 0 is (1 - 1/m) to the power kn since each of the n elements has a 1/m chance of flipping that bit to 1.

Therefore, the probability of a false positive is the chance that all k array positions for an item are 1:

False positive probability = (1 - (1 - 1/m)^(kn))^k

This can be minimized by:

Increasing m - More bits lower chance of collisions
Using more hashes k - All positions must be 1 to be a false positive

Some typical values to achieve a low false positive rate:

1% false positive rate: m ≈ 10n and k ≈ 7
0.1% false positive rate: m ≈ 20n and k ≈ 10

In practice, m is often chosen first based on the space available. k is then adjusted to reduce false positives based on the expected n.

Next, we will implement the BloomFilter class and demonstrate these principles.

Implementing a Bloom Filter in Python

We will now implement a BloomFilter class in Python containing:

__init__(): Constructor to initialize the bit array, hash functions, and attributes
add(): Add an element to the filter
contains(): Check if the filter contains an element
bit_array_size(): Return size m of the bit array
number_of_hash_functions(): Get the k hash functions count
elements_added(): Return the number of elements added to the filter
false_positive_probability(): Calculate current false positive probability

Imports and Initialization

We import mmh3 for the hash functions and math for rounding m. The constructor handles:

Allocating the bit array of size m
Generating k hash functions using mmh3
Storing m, k, and number of elements n as attributes

from mmh3 import hash as mmh3_hash
import math

class BloomFilter:

    def __init__(self, expected_items, false_positive_rate):
        """
        Initializes the Bloom filter.

        Args:
            expected_items (int): The expected number of items to be stored in the filter.
            false_positive_rate (float): The desired false positive rate (between 0 and 1).
        """
        self.n = expected_items
        self.p = false_positive_rate
        self.m = int(-(self.n * math.log(self.p)) / (math.log(2)**2))  # Calculate bit array size
        self.k = int((self.m / self.n) * math.log(2)) # Calculate number of hash functions

        self.bit_array = [0] * self.m
        self._hashes = [lambda x, i=i: mmh3_hash(x, i) % self.m for i in range(self.k)]
        self._elements_added = 0  # Initialize count of elements

    def add(self, element):
        """Adds an element to the Bloom filter."""
        for i in range(self.k):
            index = self._hashes[i](element)
            self.bit_array[index] = 1
        self._elements_added += 1

    def contains(self, element):
        """Checks if an element is likely in the Bloom filter."""
        return all(self.bit_array[h(element)] == 1 for h in self._hashes)

    def bit_array_size(self):
        """Returns the size of the bit array (m)."""
        return self.m

    def number_of_hash_functions(self):
        """Returns the number of hash functions (k)."""
        return self.k

    def elements_added(self):
        """Returns the number of elements added to the filter."""
        return self._elements_added

    def false_positive_probability(self):
        """Calculates the theoretical false positive probability."""
        return (1 - math.exp(-self.k * self._elements_added / self.m)) ** self.k

Here, the constructor is updated to calculate the optimal m and k based on the expected number of items and the desired false positive rate. The multiplication by 1.2 is no longer necessary as the size is calculated directly. We also use more descriptive variable names and follow PEP 8 conventions.

Add Element Method

The add() method hashes the input element k times and sets those indices to 1:

    def add(self, element):
        """Adds an element to the Bloom filter."""
        for i in range(self.k):
            index = self._hashes[i](element)
            self.bit_array[index] = 1
        self._elements_added += 1

This inserts the element into the filter as described earlier. The element count is incremented.

Check for Membership

To check if an item exists, we hash it k times and ensure all array positions are 1:

    def contains(self, element):
        """Checks if an element is likely in the Bloom filter."""
        return all(self.bit_array[h(element)] == 1 for h in self._hashes)

This returns True if likely present and False if definitely not.

Helper Functions

Some other useful methods are:

    def bit_array_size(self):
        """Returns the size of the bit array (m)."""
        return self.m

    def number_of_hash_functions(self):
        """Returns the number of hash functions (k)."""
        return self.k

    def elements_added(self):
        """Returns the number of elements added to the filter."""
        return self._elements_added

    def false_positive_probability(self):
        """Calculates the theoretical false positive probability."""
        return (1 - math.exp(-self.k * self._elements_added / self.m)) ** self.k

bit_array_size() returns m
number_of_hash_functions() returns k
elements_added() returns the number of elements _elements_added
false_positive_probability() calculates the current theoretical false positive rate using a more precise formula.

This completes the BloomFilter implementation. Let’s see an example of adding elements and testing membership.

Bloom Filter Usage Example

Here is an example to create a Bloom filter expecting 500 items with a 1% false positive rate. We add some names and test if they exist.

from mmh3 import hash as mmh3_hash
import math

class BloomFilter:
    def __init__(self, expected_items, false_positive_rate):
        self.n = expected_items
        self.p = false_positive_rate
        self.m = int(-(self.n * math.log(self.p)) / (math.log(2)**2))
        self.k = int((self.m / self.n) * math.log(2))
        self.bit_array = [0] * self.m
        self._hashes = [lambda x, i=i: mmh3_hash(x, i) % self.m for i in range(self.k)]
        self._elements_added = 0

    def add(self, element):
        for i in range(self.k):
            index = self._hashes[i](element)
            self.bit_array[index] = 1
        self._elements_added += 1

    def contains(self, element):
        return all(self.bit_array[h(element)] == 1 for h in self._hashes)

    def bit_array_size(self):
        return self.m

    def number_of_hash_functions(self):
        return self.k

    def elements_added(self):
        return self._elements_added

    def false_positive_probability(self):
        return (1 - math.exp(-self.k * self._elements_added / self.m)) ** self.k

filter = BloomFilter(500, 0.01)

filter.add("John")
filter.add("Mark")
filter.add("Raj")

print(filter.contains("John"))
print(filter.contains("Albert"))

The calculated false positive probability should be close to 1%. Let’s verify it by testing 1000 random names:

from random import randint

fp_count = 0
test_count = 1000
for _ in range(test_count):
    name = str(randint(0, 100000))
    if filter.contains(name):
        fp_count += 1

print(f"Observed false positive rate: {fp_count / test_count:.4f}")

We can also precisely calculate the theoretical false positive rate:

print(f"Theoretical false positive probability: {filter.false_positive_probability():.4f}")

Tuning expected_items and false_positive_rate gives full control over the accuracy vs space tradeoff. This demonstrates the core functionality and probability analysis of Bloom filters.

Applying Bloom Filters in Databases

A major application of Bloom filters is efficiently checking if database records exist before querying the disk. This reduces expensive disk lookups for non-existent elements.

For instance, consider a database storing user profiles. Instead of directly reading from disk, we first check if the user ID exists in a Bloom filter:

from mmh3 import hash as mmh3_hash
import math

class BloomFilter:
    def __init__(self, expected_items, false_positive_rate):
        self.n = expected_items
        self.p = false_positive_rate
        self.m = int(-(self.n * math.log(self.p)) / (math.log(2)**2))
        self.k = int((self.m / self.n) * math.log(2))
        self.bit_array = [0] * self.m
        self._hashes = [lambda x, i=i: mmh3_hash(x, i) % self.m for i in range(self.k)]
        self._elements_added = 0

    def add(self, element):
        for i in range(self.k):
            index = self._hashes[i](element)
            self.bit_array[index] = 1
        self._elements_added += 1

    def contains(self, element):
        return all(self.bit_array[h(element)] == 1 for h in self._hashes)

    def bit_array_size(self):
        return self.m

    def number_of_hash_functions(self):
        return self.k

    def elements_added(self):
        return self._elements_added

    def false_positive_probability(self):
        return (1 - math.exp(-self.k * self._elements_added / self.m)) ** self.k

# Assume 'database' is a set or dictionary containing existing user IDs
database = {"user123", "user456", "user789"}
bloomfilter = BloomFilter(len(database), 0.01)

# Add all existing IDs during initialization
for user_id in database:
    bloomfilter.add(user_id)

def get_user(uid):
   if bloomfilter.contains(uid):
      # ID likely exists, do disk fetch (simulated here)
      if uid in database:
          print(f"User ID '{uid}' found in database.")
          return True
      else:
          print(f"False positive! User ID '{uid}' not in database.")
          return None
   else:
      # Definitely does not exist
      print(f"User ID '{uid}' definitely not in database.")
      return None

get_user("user123")
get_user("nonexistent_user")

This prevents unnecessary disk fetches for invalid IDs. The slight false positive chance is acceptable since the disk lookup will ultimately resolve it.

Bloom filters improve caching performance and reduce latency by avoiding disk seeks for missing elements. The space overhead is also relatively small compared to storing actual profile data.

Conclusion

Bloom filters are a probabilistic data structure providing space-efficient set membership testing suitable for large data sets. We implemented a BloomFilter class in Python using murmur hashing and probability analysis.

Key takeaways include:

Bloom filters use multiple hash functions mapping elements to a fixed bit array.
Membership is checked by verifying all mapped positions are 1.
False positives are possible but not false negatives.
The false positive rate can be tuned via expected items and desired false positive rate.
Applications include databases, networking, security, and genomic analysis.
Python implementation only requires bit array manipulation and hashing.

I hope this guide provided a comprehensive overview of Bloom filter theory, Python implementation, probability analysis, and practical use cases. Bloom filters are a versatile technique for optimizing set membership tests across many domains.