--Real-World Systems — বাস্তব সিস্টেম ডিজাইন

Design Search Engine (Typeahead)

Duration৬০-৯০ মিনিট

LevelAdvanced

FocusSystem Design Case

001Why This System

Search কেন Hard?

Google প্রতিদিন 8.5 billion searches process করে। Typeahead-এ প্রতিটা keypress-এ suggestions দেখাতে হয় — 100ms-এর মধ্যে। এই system-এ দুটো আলাদা challenge: Full-text search এবং real-time autocomplete।

SEARCH ENGINE — High Level Overview

📌 Two Different Problems

Search: "wireless headphones" type করলেন relevant results দেখাও (relevance, ranking)।

Typeahead/Autocomplete: "wire" type করতেই "wireless headphones", "wireless mouse" suggestions দেখাও (speed, prefix matching)।

দুটো problem আলাদা — আলাদা data structure এবং approach দরকার।

002Requirements

Features কী কী?

✅ Functional Requirements

→Full-text search (keyword)
→Typeahead/autocomplete
→Fuzzy matching (typo tolerance)
→Faceted search (filter by category)
→Relevance ranking
→Search suggestions (trending)
→Personalized suggestions

⚡ Non-Functional Requirements

→Typeahead < 100ms latency
→Search results < 500ms
→Billions of documents indexed
→High availability 99.99%
→New content indexed within minutes
→Support 100K+ queries/sec

Back-of-Envelope — Google-এর Scale

8.5BDaily Searches

99KSearches/sec

100B+Indexed Web Pages

15%New queries daily

~20Characters avg query

~5Keystrokes per typeahead

🔢 Typeahead Requests

User "apple" type করলেন: a → ap → app → appl → apple = 5 requests। 99K searches/sec × 5 = ~500K typeahead requests/sec। Typeahead traffic search-এর চেয়ে অনেক বেশি! তাই extreme optimization দরকার।

003Typeahead / Autocomplete

Trie Data Structure

Typeahead-এর জন্য Trie (Prefix Tree) সবচেয়ে natural data structure। প্রতিটা node একটা character represent করে।

TRIE VISUALIZATION — Prefix Tree

trie.py — Typeahead Implementation

class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_end = False
        self.frequency = 0  # কতবার search হয়েছে

class Trie:
    def __init__(self):
        self.root = TrieNode()

    def insert(self, word: str, freq: int = 1):
        node = self.root
        for char in word.lower():
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        node.is_end = True
        node.frequency = freq

    def autocomplete(self, prefix: str, limit: int = 5) -> list:
        # Prefix node খুঁজে বের করুন
        node = self.root
        for char in prefix.lower():
            if char not in node.children:
                return []  # কোনো match নেই
            node = node.children[char]

        # Prefix node থেকে সব words collect করুন
        results = []
        self._dfs(node, prefix, results)

        # Frequency দিয়ে sort করুন (popular first)
        results.sort(key=lambda x: x[1], reverse=True)
        return [word for word, _ in results[:limit]]

    def _dfs(self, node, current, results):
        if node.is_end:
            results.append((current, node.frequency))
        for char, child in node.children.items():
            self._dfs(child, current + char, results)

# Usage example:
trie = Trie()
trie.insert("apple", freq=1000000)
trie.insert("application", freq=500000)
trie.insert("apply", freq=200000)

print(trie.autocomplete("app"))
# → ["apple", "application", "apply"]  (by frequency)

⚠️ Trie at Scale Problem

Billions of search queries-এর Trie memory-তে রাখা possible না। Solution: Top-K suggestions per node store করুন। প্রতিটা prefix node-এ top 5-10 most popular completions pre-computed রাখুন। DFS করতে হবে না — O(1) lookup।

004Web Crawling Architecture

Web Crawler — Internet কীভাবে Index হয়?

Search engine index তৈরি করতে প্রথমে web থেকে documents collect করতে হয়। এই কাজ করে Web Crawler (বা Spider)। Google প্রতিদিন billions of pages crawl করে।

STEP 01[object Object]

Seed URLs — শুরুর URLs

কিছু popular domains দিয়ে শুরু করুন (wikipedia, news sites)। এগুলো queue-এ add করুন।

STEP 02[object Object]

URL Frontier (Priority Queue)

Crawl করার pending URLs এর queue। Priority দিয়ে important pages আগে crawl করুন। Politeness policy — same domain বারবার hit না করা।

STEP 03[object Object]

HTML Fetcher & Parser

HTTP request করে HTML নামাও। Links extract করুন। Robots.txt respect করুন। Content Document Processor-এ পাঠাও।

STEP 04[object Object]

Duplicate Detection

একই content বা URL আবার crawl করা waste। URL fingerprint (hash) store করুন। Already seen? Skip করুন।

STEP 05[object Object]

Document Store + Indexing Pipeline

Crawled content S3/HDFS-এ store করুন। Indexing pipeline-এ পাঠাও। New links queue-এ add করুন। Repeat ✓

💡 Distributed Crawling

Single crawler billions of pages handle করতে পারবেন না। Distributed crawler cluster use করুন। Consistent hashing দিয়ে URLs different crawler nodes-এ assign করুন। Same domain same node-এ যায় — politeness enforce সহজ।

005Deep Dive — Full-Text Search

Inverted Index — Search-এর Core

Search engine-এর সবচেয়ে important data structure হলো Inverted Index। এটা basically একটা dictionary যেখানে word → document list map করা থাকে।

📌 Inverted Index কীভাবে কাজ করে

Forward index: Document → Words। Inverted index: Word → Documents। "apple" search করলেন directly "apple" key look up করুন → সব relevant documents পানয়া যায়। Sequential scan দরকার নেই।

inverted_index_example.py

# 3টা document আছে:
# Doc 1: "apple iphone review"
# Doc 2: "apple macbook laptop"
# Doc 3: "samsung galaxy review"

# Inverted Index (simplified):
inverted_index = {
    "apple":   [1, 2],       # Doc 1 এবং 2 এ আছে
    "iphone":  [1],          # শুধু Doc 1 এ
    "review":  [1, 3],       # Doc 1 এবং 3 এ
    "macbook": [2],          # শুধু Doc 2 এ
    "samsung": [3],          # শুধু Doc 3 এ
    "galaxy":  [3],          # শুধু Doc 3 এ
}

# Search "apple review" করলেন:
# apple → [1, 2]
# review → [1, 3]
# Intersection → [1] ← Most relevant!

def search(query: str) -> list:
    words = query.lower().split()
    # প্রতিটা word-এর document list নাও
    doc_lists = [set(inverted_index.get(w, [])) for w in words]
    # Intersection করুন (সব word আছে এমন docs)
    result = doc_lists[0].intersection(*doc_lists[1:])
    return sorted(result)

📊 TF-IDF — Relevance Ranking

সব documents-এ word থাকলেই হবে না — কোনটা সবচেয়ে relevant সেটা rank করতে হবে। TF-IDF (Term Frequency × Inverse Document Frequency) এটা calculate করে।

Metric	Formula	মানে কী
TF (Term Frequency)	`word count / total words`	Document-এ word কতবার আছে
IDF (Inverse Doc Freq)	`log(total docs / docs with word)`	Word কতটা rare/unique
TF-IDF Score	`TF × IDF`	Higher = more relevant

💡 Simple Example

"the" শব্দটা সব document-এ থাকে → IDF low → Score low। "Elasticsearch" rare word, শুধু কিছু docs-এ → IDF high। এই doc-এ "Elasticsearch" অনেকবার → TF high → Score very high। তাই "Elasticsearch tutorial" search করলেন এই document আগে আসবেন।

006Search Ranking

Search Ranking Algorithms

TF-IDF দিয়ে basic relevance বোঝা যায়, কিন্তু modern search engines আরও অনেক signal use করে। Google-এর ranking 200+ factors consider করে।

Ranking Signal	কী দেখে	Weight
TF-IDF	Word frequency & rarity	Medium
PageRank	কতটা important sites link করেছেনে	High
Click-through Rate	Users কতবার result click করে	High
Freshness	Content কতটা নতুন	Medium
User Location	Local results prioritize	Context-based
ML Models (BERT)	Query intent বোঝা	Very High

📌 PageRank — Link Graph Analysis

Larry Page এবং Sergey Brin-এর original algorithm। Web pages একটা directed graph। যে page-এ বেশি important sites link করেছেনে, সেটার PageRank বেশি। Iterative calculation — সব pages-এর score calculate হওয়ার পর converge করে। আজও Google-এর core ranking signal।

💡 Modern: Learning to Rank (LTR)

Traditional signals এর বাইরে আজকাল Machine Learning use করা হয়। BERT দিয়ে query intent বোঝো। User behavior (clicks, dwell time) থেকে learn করুন। LambdaMART, RankNet — ML-based ranking models। Google BERT 2019 থেকে use করছে।

007Database Choice

কোন Data কোথায়?

Data	Storage	Why?
Search index (inverted)	Elasticsearch / Lucene	Purpose-built for full-text search
Typeahead Trie	Redis (Sorted Sets) / In-memory	Sub-millisecond prefix lookup
Popular queries	Redis Sorted Set (ZADD score)	Rank by search frequency
Search analytics	Kafka → Hadoop/Spark	Stream processing for trending
Document store	Elasticsearch + S3	Original content + index
Query cache	Redis (TTL 1 hr)	Popular searches cached

💡 Redis Sorted Set for Typeahead

Redis ZADD "queries" <frequency> "apple"। ZRANGEBYSCORE দিয়ে top queries পান। Prefix matching-এর জন্য ZRANGEBYLEX command। Real-time frequency update করা যায়। Production-grade typeahead শুধু Redis দিয়েই possible।

008Scaling Decisions

Search Scale করার উপায়

Strategy

Index Sharding: Billions of documents single index-এ রাখা যাবেন না। Document ID বা alphabetically shard করুন। Query সব shards-এ যায়, results merge হয়।

Strategy

Query Cache: Popular queries (যেমন "iphone 15") result cache করুন Redis-এ। 80% queries repeat হয়। Cache hit হলে index touch করতে হয় না।

Strategy

Typeahead CDN: Popular prefixes ("app", "goo", "you") static JSON file হিসেবে CDN-এ serve করুন। Server hit হয় না।

Trade-off

Index Freshness vs Speed: Real-time indexing করলেন latency বাড়ে। Batch indexing করলেন নতুন content দেখাতে দেরি হয়। Google 15-minute crawl lag normal।

Trade-off

Fuzzy Search Cost: Typo tolerance (fuzzy matching) expensive। Edit distance computation করতে হয়। Solution: Pre-compute common typos, phonetic matching।

Tech Stack

Core Search

ElasticsearchApache Lucene (underlying)Apache SolrJava / Go

Typeahead

Redis Sorted SetsIn-memory TrieCDN (static prefix cache)

Data Pipeline

Apache KafkaApache Spark (batch)Flink (stream processing)Hadoop HDFS

009Interview Tips

Interview-এ কী জিজ্ঞেস করা হয়?

🎯 Q1: Search এবং Typeahead-এর পার্থক্য কী?

Search: Full-text query নিয়ে relevant documents return করে। Inverted index use করে। 500ms পর্যন্ত acceptable।
Typeahead: Prefix দিয়ে suggestions দেয়। Trie / Redis Sorted Set use করে। 100ms limit। দুটো আলাদা system — আলাদা optimize করতে হয়।

🎯 Q2: Elasticsearch vs Regular Database কেন?

Regular DB (Postgres LIKE %word%): Full table scan, slow, no ranking। Billions of rows-এ impractical।
Elasticsearch (Inverted Index): O(1) lookup per word, TF-IDF scoring built-in, distributed sharding, fuzzy search support। Purpose-built tool always wins।

🎯 Q3: Typeahead কীভাবে scale করবেন?

1) In-memory Trie — single server। 2) Redis Sorted Set — distributed, real-time frequency update। 3) Top-K per prefix node — DFS avoid করুন। 4) Popular prefixes CDN-এ cache। 5) Debounce — user থামলে তবেই request (300ms wait)।

🎯 Q4: Typo Tolerance কীভাবে implement করবেন?

Levenshtein distance: দুটো string-এর মধ্যে minimum edit operations। "appel" → "apple" = 1 edit। Elasticsearch fuzziness parameter দিয়ে configure করা যায়। Distance ≤ 2 হলে match। Cost বেশি বলে common typos pre-compute করে cache করুন।

010Lesson Summary

SUMMARY — আজকে যা শিখলাম

Challenge	Solution	Technology
Full-text search	Inverted index	Elasticsearch/Lucene
Autocomplete	Trie + frequency	Redis Sorted Set
Relevance ranking	TF-IDF scoring	Elasticsearch scoring
Scale index	Sharding	Elasticsearch shards
Trending queries	Stream processing	Kafka + Flink
Typo tolerance	Edit distance	Elasticsearch fuzziness

011Knowledge Check

012Assignments

013Practical Lab

Previous Topic

Design Amazon/E-commerce

Next Topic

Design Search Engine (Typeahead)

Search কেন Hard?

📌 Two Different Problems

Features কী কী?

🔢 Typeahead Requests

Trie Data Structure

⚠️ Trie at Scale Problem

Web Crawler — Internet কীভাবে Index হয়?

Seed URLs — শুরুর URLs

URL Frontier (Priority Queue)

HTML Fetcher & Parser

Duplicate Detection

Document Store + Indexing Pipeline

💡 Distributed Crawling

Inverted Index — Search-এর Core

📌 Inverted Index কীভাবে কাজ করে

📊 TF-IDF — Relevance Ranking

💡 Simple Example

Search Ranking Algorithms

📌 PageRank — Link Graph Analysis

💡 Modern: Learning to Rank (LTR)

কোন Data কোথায়?

💡 Redis Sorted Set for Typeahead

Search Scale করার উপায়

Tech Stack

Interview-এ কী জিজ্ঞেস করা হয়?

🎯 Q1: Search এবং Typeahead-এর পার্থক্য কী?

🎯 Q2: Elasticsearch vs Regular Database কেন?

🎯 Q3: Typeahead কীভাবে scale করবেন?

🎯 Q4: Typo Tolerance কীভাবে implement করবেন?

SUMMARY — আজকে যা শিখলাম

Design Amazon/E-commerce

Design Uber/Ride Sharing