Advanced Data Structures (Bloom Filter, HyperLogLog, Count-Min Sketch)
কেন এই 'Weird' Data Structures দরকার?
কল্পনা করুন ১ বিলিয়ন URL-এর একটা set রাখতে হবে। Standard HashSet use করলেন ৫০+ GB memory লাগবে। একটা server-এ এটা রাখা impossible। কিন্তু যদি ৯৯%+ accuracy তে কাজ চলে — তাহলে মাত্র ১ MB-তে সব কিছু রাখা সম্ভব।
💾 Memory Problem
1 Billion URLs × 50 bytes = 50 GB। কোনো single server-এ এটা রাখা possible না। Distributed হলেও network overhead massive।
HashSet: 50 GB⚡ Speed Problem
HashSet-এ lookup করতে হলে hash calculate করতে হয়, memory access করতে হয়। Billion-item set-এ cache miss হলে 100ns+ latency।
Cache miss rate: 90%+📈 Scale Problem
Web crawlers প্রতি second লাখো URL check করে। Exact membership test এই throughput-এ scalable না।
Throughput: 1M checks/sec needed⚖️ The Trade-off
Approximate answer accept করলেন — ৯৯%+ accurate হলে production-এ sufficient। False positive rate controllable।
Bloom Filter: 1 MB (50,000x savings!)Memory Comparison — 1 Billion URLs
📌 Core Insight
এই structures exact answer দেয় না — approximate answer দেয়। কিন্তু সেই approximate answer production এ sufficient। Google Chrome, Cassandra, Redis, LinkedIn — সবাই এই structures production-এ use করে। "আমি ৯৯% confident এই URL আগে দেখিনি" — এটাই web crawling-এর জন্য যথেষ্ট।
Bloom Filter — 'আছে কিনা' দ্রুত বলুন
Bloom Filter একটা probabilistic data structure যা বলতে পারে: "এই element definitely নেই" অথবা "এই element probably আছে"। কখনো false negative দেয় না।
Bloom Filter — Bit Array + Hash Functions
class BloomFilter {
constructor(size = 1000, hashCount = 3) {
this.size = size;
this.hashCount = hashCount;
this.bitArray = new Uint8Array(size); // All zeros initially
}
// Multiple hash functions using different seeds
_hash(item, seed) {
let hash = seed;
for (let i = 0; i < item.length; i++) {
hash = (hash * 31 + item.charCodeAt(i)) % this.size;
}
return Math.abs(hash);
}
// Add an element (set bits)
add(item) {
for (let i = 0; i < this.hashCount; i++) {
const pos = this._hash(item, i * 0xdeadbeef);
this.bitArray[pos] = 1;
}
}
// Check if element might exist
mightContain(item) {
for (let i = 0; i < this.hashCount; i++) {
const pos = this._hash(item, i * 0xdeadbeef);
if (this.bitArray[pos] === 0) {
return false; // Definitely NOT in set (100% certain)
}
}
return true; // Probably in set (might be false positive)
}
// Memory usage in bytes
memoryBytes() {
return this.bitArray.byteLength;
}
}
// Real-world use: Username availability check
const usernameFilter = new BloomFilter(100_000, 5);
// Pre-populate with existing usernames
['alice', 'bob', 'charlie', 'devripon', 'system_admin'].forEach(u =>
usernameFilter.add(u)
);
// Check new username — avoid DB query if definitely not exists
function checkUsername(username) {
if (!usernameFilter.mightContain(username)) {
// Definitely available — no DB query needed!
return { available: true, checkedDB: false };
}
// Might exist — check DB to confirm (avoids false negative)
const existsInDB = queryDatabase(username); // actual DB check
return { available: !existsInDB, checkedDB: true };
}
// Example:
console.log(checkUsername('xyzabc123')); // { available: true, checkedDB: false }
console.log(checkUsername('alice')); // { available: false, checkedDB: true }
// Stats: 1 billion URLs → HashSet ~50 GB vs Bloom Filter ~1 MB (50,000x savings)
console.log(`Memory used: ${usernameFilter.memoryBytes()} bytes`);| Property | Value | Notes |
|---|---|---|
| False negative rate | 0% (never) | If Bloom says "no" → definitely not in set |
| False positive rate | ~1% (configurable) | Bloom says "yes" but not actually in set |
| Memory (1B items) | ~1 MB | vs HashSet ~50 GB → 50,000x savings |
| Time complexity | O(k) — k = hash functions | Constant time regardless of set size |
| Supports delete? | No (basic) | Use Counting Bloom Filter for deletion |
| Use case | Membership test | Google Chrome malicious URL check |
⚠️ False Positive কিন্তু False Negative কখনো না
Bloom Filter এর false negative হয় না — যদি বলে "নেই" তাহলে সত্যিই নেই। কিন্তু false positive হতে পারে — বলতে পারে "আছে" কিন্তু আসলে নেই। এই property-র কারণে web crawling, cache, username check-এ perfect fit।
False positive rate কমাতে: bit array size বাড়াও অথবা hash function count বাড়াও।
HyperLogLog — Unique Count করুন, Memory তে নয়
YouTube-এ একটা video-তে ১ বিলিয়ন ইউনিক views count করতে হবে। Exact counter রাখলে ১ বিলিয়ন entries। HyperLogLog মাত্র ১২ KB memory তে এটা করতে পারে — ৯৯%+ accuracy সহ।
HyperLogLog — How It Works
Hash করুন
প্রতিটা element hash করুন → uniform binary string পান। "user_123" → "0001101100110..."
Leading Zeros Count করুন
Hash-এর শুরুতে কতটা 0 আছে সেটা track করুন। "0001101..." → 3 leading zeros।
Maximum Track করুন
সর্বোচ্চ leading zeros count রাখুন। Max leading zeros = k হলে → set-এ ≈ 2^k unique elements।
Cardinality Estimate করুন
Multiple sub-streams average করুন। Harmonic mean use করে accuracy improve করুন। Error rate: ~0.81%।
# Redis HyperLogLog — Track unique website visitors
# Add visitors (PFADD — Probabilistic Frequency ADD)
PFADD daily_visitors:2026-05-08 "user:1001"
PFADD daily_visitors:2026-05-08 "user:1002"
PFADD daily_visitors:2026-05-08 "user:1001" # duplicate — ignored
PFADD daily_visitors:2026-05-08 "user:1003"
# Count unique visitors (PFCOUNT — Probabilistic Frequency COUNT)
PFCOUNT daily_visitors:2026-05-08
# Returns: 3 (not 4, because user:1001 was added twice)
# Merge multiple HyperLogLogs (e.g., weekly unique visitors)
PFMERGE weekly_visitors:2026-W19 daily_visitors:2026-05-04 daily_visitors:2026-05-05 daily_visitors:2026-05-06 daily_visitors:2026-05-07 daily_visitors:2026-05-08
PFCOUNT weekly_visitors:2026-W19
# Returns: ~unique visitors for the whole week (12 KB memory!)
# ──────────────────────────────────────────────────────────
# Node.js example — YouTube view counter
# ──────────────────────────────────────────────────────────const Redis = require('ioredis');
const redis = new Redis();
// Track unique video view
async function trackVideoView(videoId, userId) {
const key = `video:views:${videoId}`;
// HyperLogLog — O(1) memory regardless of views count
await redis.pfadd(key, userId);
// Get approximate unique view count
const uniqueViews = await redis.pfcount(key);
return uniqueViews;
}
// Get daily active users (DAU)
async function getDailyActiveUsers(date) {
const key = `dau:${date}`;
return redis.pfcount(key);
}
// Get weekly active users by merging daily HLLs
async function getWeeklyActiveUsers(weekDates) {
const destKey = `wau:week`;
const dailyKeys = weekDates.map(d => `dau:${d}`);
await redis.pfmerge(destKey, ...dailyKeys);
return redis.pfcount(destKey);
}
// Usage:
// trackVideoView('v_abc123', 'user_456') → returns unique view count
// Memory: ~12 KB regardless of 1M or 1B unique viewers!
// Real-world stats:
// YouTube: uses HyperLogLog for unique views
// LinkedIn: unique visitor counts per post
// Redis itself: PFADD / PFCOUNT built-in💡 Redis তে Built-in HyperLogLog Support
Redis নিজেই HyperLogLog support করে — PFADD দিয়ে element add করুন, PFCOUNT দিয়ে unique count পান, PFMERGE দিয়ে multiple counters merge করুন। কোনো external library দরকার নেই। Daily active users, unique video views, unique page visitors — সব কিছু Redis দিয়েই করা যায়।
Count-Min Sketch — Frequency Estimate করুন
Twitter-এ trending topics বের করতে হলে প্রতিটা word-এর count রাখতে হবে। কিন্তু ১ কোটি unique word-এর exact counter রাখা অসম্ভব। Count-Min Sketch approximate frequency count করে — fixed memory-তে।
Count-Min Sketch — 2D Counter Array
| Row | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|---|
| h1(x) | 0 | 1 | 3 | 1 | 0 | 4 | 0 | 2 |
| h2(x) | 2 | 3 | 0 | 1 | 2 | 1 | 4 | 0 |
| h3(x) | 1 | 0 | 2 | 3 | 4 | 0 | 1 | 4 |
Query result = minimum of all hash positions। Overestimate হতে পারে, underestimate হয় না। Conservative update error কমায়।
import hashlib
class CountMinSketch:
def __init__(self, width=1000, depth=5):
"""
width: columns (counter array size)
depth: rows (number of hash functions)
Higher width/depth = more accuracy but more memory
"""
self.width = width
self.depth = depth
self.table = [[0] * width for _ in range(depth)]
def _hash(self, item, seed):
"""Hash item with seed to get column index"""
h = hashlib.md5(f"{seed}:{item}".encode()).hexdigest()
return int(h, 16) % self.width
def increment(self, item, count=1):
"""Add item occurrence"""
for row in range(self.depth):
col = self._hash(item, row)
self.table[row][col] += count
def estimate(self, item):
"""Get estimated frequency — returns minimum across all rows"""
return min(
self.table[row][self._hash(item, row)]
for row in range(self.depth)
)
def memory_bytes(self):
return self.width * self.depth * 4 # 4 bytes per int
# Twitter Trending Topics Example
cms = CountMinSketch(width=10_000, depth=7)
# Simulate tweet stream
tweets = [
"Bangladesh", "Cricket", "Bangladesh", "ICC",
"Bangladesh", "World Cup", "Cricket", "Bangladesh",
"Dhaka", "Bangladesh", "Cricket"
]
for word in tweets:
cms.increment(word)
# Query frequencies
words = ["Bangladesh", "Cricket", "World Cup", "Dhaka", "Football"]
for word in words:
freq = cms.estimate(word)
print(f"{word}: ~{freq} occurrences")
# Bangladesh: ~5
# Cricket: ~3
# World Cup: ~1
# Dhaka: ~1
# Football: ~0 ← not in stream
# Heavy hitter detection (trending)
def get_trending(cms, candidates, threshold=2):
return [w for w in candidates if cms.estimate(w) >= threshold]
print(get_trending(cms, words))
# Output: ['Bangladesh', 'Cricket']
print(f"Memory: {cms.memory_bytes()} bytes vs exact dict: huge")📌 Conservative Update কীভাবে Error কমায়?
Standard update: সব positions increment করুন। Conservative update: শুধু minimum value-র positions increment করুন। এতে overcounting কমে এবং estimate আরো accurate হয়। Twitter-এর trending algorithm, network traffic analysis, এবং database query optimization-এ এই technique use হয়।
তিনটা Structure এর Comparison
| Structure | Memory | Accuracy | Use Case | Supports Delete? | Real Tool |
|---|---|---|---|---|---|
| Bloom Filter | ~1 MB / 1B items | 99%+ (0 false negative) | Membership test | No | Redis Bloom (BF.ADD) |
| HyperLogLog | 12 KB (fixed!) | ~99.19% | Unique count (cardinality) | No | Redis PFCOUNT |
| Count-Min Sketch | Configurable | ~99% | Frequency estimation | Sort of (decay) | Redis CMS.QUERY |
💡 কোনটা কখন Use করবেন?
Bloom Filter: "এটা set-এ আছে কিনা?" — web crawl dedup, username availability, cache miss reduction।
HyperLogLog: "কতটা unique element?" — unique visitors, unique views, DAU/MAU counting।
Count-Min Sketch: "এটা কতবার দেখা গেছে?" — trending topics, heavy hitters, network traffic top-K।
Redis এ Production Use
Redis Stack এই তিনটা probabilistic structure built-in support করে। Production-এ কোনো custom implementation দরকার নেই।
# ═══════════════════════════════════════════════════════
# 1. BLOOM FILTER — BF commands (RedisBloom module)
# ═══════════════════════════════════════════════════════
# Create bloom filter (error rate 0.01 = 1%, capacity 1M items)
BF.RESERVE crawled_urls 0.01 1000000
# Add URLs
BF.ADD crawled_urls "https://google.com"
BF.ADD crawled_urls "https://facebook.com"
# Check if URL was crawled before
BF.EXISTS crawled_urls "https://google.com" # → 1 (probably yes)
BF.EXISTS crawled_urls "https://unknown.com" # → 0 (definitely no)
# Batch add/check
BF.MADD crawled_urls "https://a.com" "https://b.com" "https://c.com"
BF.MEXISTS crawled_urls "https://a.com" "https://xyz.com"
# → [1, 0]
# ═══════════════════════════════════════════════════════
# 2. HYPERLOGLOG — PF commands (built into Redis core)
# ═══════════════════════════════════════════════════════
PFADD unique_visitors:2026-05-08 "user:1001" "user:1002" "user:1003"
PFCOUNT unique_visitors:2026-05-08 # → ~3
# Merge for weekly count
PFMERGE weekly_visitors weekly_visitors:mon weekly_visitors:tue
# ═══════════════════════════════════════════════════════
# 3. COUNT-MIN SKETCH — CMS commands (RedisBloom module)
# ═══════════════════════════════════════════════════════
# Initialize CMS (width=2000, depth=7)
CMS.INITBYDIM trending_words 2000 7
# Increment word counts from tweet stream
CMS.INCRBY trending_words "Bangladesh" 1 "Cricket" 1 "Bangladesh" 1
# Query frequencies
CMS.QUERY trending_words "Bangladesh" "Cricket" "Football"
# → [2, 1, 0]const Redis = require('ioredis');
// Redis Stack connection (includes RedisBloom module)
const redis = new Redis({
host: 'localhost',
port: 6379,
});
// ─── Bloom Filter ─────────────────────────────────────────
async function setupCrawler() {
// Create bloom filter: 1% error rate, 10M capacity
await redis.call('BF.RESERVE', 'crawled', '0.01', '10000000');
async function shouldCrawl(url) {
const exists = await redis.call('BF.EXISTS', 'crawled', url);
if (exists === 0) {
await redis.call('BF.ADD', 'crawled', url);
return true; // Definitely new URL — crawl it
}
return false; // Probably already crawled — skip
}
return { shouldCrawl };
}
// ─── HyperLogLog ──────────────────────────────────────────
async function trackPageView(pageId, userId) {
const key = `page:unique:${pageId}`;
await redis.pfadd(key, userId);
const uniqueViews = await redis.pfcount(key);
return uniqueViews;
}
// ─── Count-Min Sketch ─────────────────────────────────────
async function trackHashtag(hashtag) {
await redis.call('CMS.INCRBY', 'trending', hashtag, '1');
}
async function getHashtagFrequency(hashtag) {
const result = await redis.call('CMS.QUERY', 'trending', hashtag);
return result[0]; // Estimated count
}
// Usage:
// trackPageView('video_abc', 'user_123') → returns unique view count
// trackHashtag('#Bangladesh') → increments
// getHashtagFrequency('#Bangladesh') → returns ~frequency🔴 RedisBloom Module Required
BF.* এবং CMS.* commands-এর জন্য RedisBloom module (Redis Stack-এর part) load করতে হবে। PFADD / PFCOUNT Redis core-এ আছে — extra module দরকার নেই। Production-এ Redis Stack Docker image use করলেন সব automatically available।
Real World: Big Tech কোথায় Use করে?
| Company | Use Case | Structure | Benefit |
|---|---|---|---|
| Web crawl URL deduplication | Bloom Filter | Skip already-crawled URLs without DB query | |
| Unique visitors per post/profile | HyperLogLog | 12 KB per counter vs GB of exact data | |
| Twitter / X | Trending topics (hashtag frequency) | Count-Min Sketch | Real-time top-K without storing every tweet |
| Cassandra | SSTable key existence check | Bloom Filter | Avoid disk read if key definitely not in SSTable |
| Chrome Browser | Malicious URL check (Safe Browsing) | Bloom Filter | Client-side check without sending URL to server |
| Akamai CDN | Cache hot objects identification | Count-Min Sketch | Detect frequently requested content for caching |
| Redis | HyperLogLog (PFADD/PFCOUNT) | HyperLogLog | Built-in DAU / unique views at massive scale |
| Amazon | Recently viewed items deduplication | Bloom Filter | Avoid showing already-seen recommendations |
📌 Cassandra এবং Bloom Filter
Cassandra প্রতিটা SSTable-এর জন্য একটা Bloom Filter maintain করে। Read request আসলে আগে Bloom Filter check করে — "এই key কি এই SSTable-এ আছে?" যদি "definitely not" হয় তাহলে disk read skip করে। এতে unnecessary disk I/O ৬০-৮০% কমে যায়।
Interview Tips — Probabilistic Structures
System design interview-এ এই structures mention করলেন senior-level knowledge দেখা যায়। কিন্তু জানতে হবে কখন mention করতে হবে এবং trade-off কীভাবে explain করতে হবে।
Common Interview Questions
Q: How would you design a web crawler that doesn't revisit URLs?
A: Bloom Filter দিয়ে already-visited URLs track করুন। HashSet-এর ৫০,০০০ গুণ কম memory-তে।
Q: How does YouTube count unique views efficiently?
A: HyperLogLog use করুন। ১ বিলিয়ন unique viewers-ও মাত্র ১২ KB memory-তে count করা যায়।
Q: How would you find trending topics on Twitter in real-time?
A: Count-Min Sketch দিয়ে hashtag frequency track করুন। Fixed memory-তে top-K find করুন।
Q: How does Cassandra optimize read performance?
A: SSTable-এ Bloom Filter use করে unnecessary disk read এড়ায়। ৬০-৮০% I/O reduction।
Q: Design a username availability check at scale
A: Bloom Filter দিয়ে first check করুন। Definitely not taken → no DB query। Probably taken → DB confirm।
💡 Magic Phrase for Interviews
Interview-এ সবসময় বলুন: "If we can tolerate a small error rate, probabilistic data structures give us huge memory and speed gains." তারপর specific structure mention করুন। এই phrase interviewer-কে বুঝিয়ে দেয় যে আপনি trade-off consciously accept করছো — এটাই senior-level thinking।
SUMMARY — আজকে যা শিখলাম
| Structure | Memory | Accuracy | False Neg? | Use Case | Real Tool |
|---|---|---|---|---|---|
| Bloom Filter | ~1 MB / 1B items | 99%+ | Never | Membership test (is URL crawled?) | Redis BF.ADD |
| HyperLogLog | 12 KB (always) | ~99.19% | N/A | Unique count (DAU, unique views) | Redis PFCOUNT |
| Count-Min Sketch | Configurable | ~99% | N/A | Frequency estimate (trending topics) | Redis CMS.QUERY |
| HashSet (exact) | 50 GB / 1B items | 100% | Never | When exactness is required | Java HashSet |
🎉 System Design Mastery সম্পূর্ণ! সমগ্র Course Finished
System Design Mastery সম্পূর্ণ! ৫টা Phase, ২৯টা Topic — আপনি এখন একজন System Design Expert।