System DesignMastery
--Real-World Systems — বাস্তব সিস্টেম ডিজাইন

Design WhatsApp/Chat System

Duration৬০-৯০ মিনিট
LevelIntermediate
FocusSystem Design Case
001Why This System

Chat System-এর Unique Challenges

WhatsApp-এ প্রতিদিন 100 billion messages send হয়। এই system-এ সবচেয়ে কঠিন সমস্যা হলো real-time delivery — message পাঠানো সাথে সাথে receiver পাবেন, এমনকি receiver offline থাকলেও পরে পাবেন।

WHATSAPP — Message Flow Overview

USER AOnlineWebSocketWSCHATSERVER 1Holds User A'sWS connectionErlang/GoMESSAGEQUEUEKafka / RabbitMQCHATSERVER 2Holds User B'sWS connectionErlang/GoWSUSER BOnlineGets msg!USER COfflineMSG DBCassandrapersistPRESENCEOnline status (Redis)CONNECTIONZookeeper/RedisUser C offline → msg DB stored → online হলে fetch

📌 Core Challenge

Push vs Pull:Traditional HTTP request-response কাজ করে না real-time messaging-এ। Receiver সব সময় "নতুন message আছে?" জিজ্ঞেস করতে পারে না। Server-ই push করতে হবে। এই জন্য WebSocket দরকার।

002Requirements

Features কী কী?

Functional Requirements

  • 1-to-1 messaging
  • Group messaging (max 1000 members)
  • Online/offline status
  • Message delivery status (✓ ✓✓ 🔵)
  • Media sharing (image, video, file)
  • End-to-end encryption
  • Last seen timestamp
  • Message history

Non-Functional Requirements

  • 2 billion+ users
  • Message delivery < 500ms
  • No message loss ever
  • Offline message delivery (when comes online)
  • High availability 99.99%
  • End-to-end encrypted
003Back-of-Envelope Estimation

WhatsApp Scale

2BTotal Users
100BMessages/Day
1.15MMessages/sec
500MDaily Active Users
~100MWebSocket Connections
~1KBAvg Message Size

Storage Estimation

Daily storage: 100B messages × 1KB = 100TB/day

5-year storage: 100TB × 365 × 5 = ~182 PB

Write throughput: 1.15M msg/sec × 1KB = 1.15 GB/sec

🔢 Interesting Fact

WhatsApp মাত্র 50 engineers দিয়ে 450M users serve করত (2014 সালে)। Erlang language use করত — concurrent connections-এর জন্য world-class। এরপর Facebook $19 billion-এ কিনে নেয়।

004High Level Architecture

Chat System Architecture

WhatsApp-এর architecture-এর core হলো Chat Servers যেগুলো user-দের WebSocket connections maintain করে। Users different servers-এ connected থাকে, তাই cross-server routing-এর জন্য Kafka use করা হয়।

WHATSAPP — Full System Architecture

Client Layer

Mobile App (iOS/Android)
Desktop App
Web Client

WebSocket persistent connection

Server Layer

Load Balancer (HAProxy)
Chat Servers (Erlang)
API Gateway
Kafka (Message Queue)
Zookeeper (Service Discovery)

Storage Layer

Cassandra (Messages)
MySQL (User Profiles)
Redis (Presence + Routing)
S3 + CDN (Media)
Elasticsearch (Search)
STEP 01[object Object]

Step 1 — User A connects via WebSocket

App open করলেন Chat Server-এ persistent WebSocket connection establish হয়। Redis-এ user_id → server_id mapping save হয়।

STEP 02[object Object]

Step 2 — Message Send করা হয়

User A message পাঠায়। Chat Server 1 message receive করে, Cassandra-তে persist করে (durability ensure করতে)।

STEP 03[object Object]

Step 3 — Cross-server routing (যদি দরকার হয়)

Redis check করে User B কোন server-এ। Same server হলে direct push। Different server হলে Kafka-তে route করুন।

STEP 04[object Object]

Step 4 — User B-তে Deliver

Chat Server 2 Kafka থেকে message consume করে User B-এর WebSocket-এ push করে। Double tick (✓✓) send হয়।

STEP 05[object Object]

Step 5 — Offline handling

User C offline থাকলে message Cassandra-তে stored থাকে। Online হলে pending messages fetch করে push করা হয়।

005Deep Dive

WebSocket Connection Management

HTTP request-response model real-time chat-এর জন্য suitable না। WebSocket একটা persistent bidirectional connection রাখে — server যেকোনো সময় client-কে push করতে পারে।

ProtocolConnectionDirectionLatencyChat Use?
HTTP PollingNew each timeClient → Server onlyHigh (500ms+)Terrible
Long PollingHeld openServer can respondMediumOkay
Server-Sent EventsPersistentServer → Client onlyLowOne-way only
WebSocketPersistentBoth directionsVery Low (< 50ms)✓ Perfect
chat_server.py
import asyncio
import websockets
import json

# Active connections: user_id → websocket
connections = {}

async def handle_connection(websocket, path):
    user_id = await authenticate(websocket)

    # Connection register করুন
    connections[user_id] = websocket
    await redis.set(f"user:{user_id}:server", SERVER_ID)
    await redis.set(f"user:{user_id}:online", "true")

    try:
        async for raw_msg in websocket:
            msg = json.loads(raw_msg)
            await send_message(
                sender_id=user_id,
                receiver_id=msg['to'],
                content=msg['content']
            )
    finally:
        # Disconnect হলে cleanup
        del connections[user_id]
        await redis.set(f"user:{user_id}:online", "false")
        await redis.set(f"user:{user_id}:last_seen", now())

async def send_message(sender_id, receiver_id, content):
    msg_id = generate_unique_id()

    # 1. DB-তে persist করুন আগে
    await db.save_message(msg_id, sender_id, receiver_id, content)

    # 2. Receiver কোন server-এ? Redis check করুন
    receiver_server = await redis.get(f"user:{receiver_id}:server")

    if receiver_server == SERVER_ID:
        # Same server-এ আছে → direct push
        ws = connections.get(receiver_id)
        if ws:
            await ws.send(json.dumps({"msg_id": msg_id, "content": content}))
    else:
        # Different server → Kafka-তে route করুন
        await kafka.publish(f"chat:{receiver_server}", {
            "receiver_id": receiver_id, "msg_id": msg_id
        })

💡 Heartbeat Mechanism

WebSocket connection alive রাখতে ping/pong heartbeatদরকার। Client প্রতি 30 sec ping পাঠায়, server pong দেয়। Redis-এ user online key refresh হয়। 60 sec কোনো response না পেলে connection dead — user "offline" mark করুন।

006Message Storage & Delivery

Message Storage এবং Delivery Receipts

Message Delivery Status — ✓ ✓✓ 🔵

StatusMeaningWhen?
✓ (single tick)Server receivedMessage DB-তে save হলে
✓✓ (double tick)Delivered to deviceReceiver-এর phone-এ পৌঁছালে
🔵 (blue tick)ReadReceiver message open করলেন

Cassandra Message Schema

Messages time-series data। Cassandra-তে partition key = chat_id, clustering key = message_id DESC— latest messages সামনে আসে, pagination সহজ।

cassandra_schema.cql
-- 1-to-1 Messages Table
CREATE TABLE messages (
    chat_id      UUID,
    message_id   TIMEUUID,     -- Time-ordered UUID
    sender_id    UUID,
    receiver_id  UUID,
    content      TEXT,
    msg_type     TEXT,         -- 'text', 'image', 'video'
    status       TEXT,         -- 'sent', 'delivered', 'read'
    created_at   TIMESTAMP,
    PRIMARY KEY (chat_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
-- Partition by chat_id = same conversation একই partition-এ
-- Latest messages DESC = pagination efficient

-- Group Messages (1 copy per message, not per user)
CREATE TABLE group_messages (
    group_id     UUID,
    message_id   TIMEUUID,
    sender_id    UUID,
    content      TEXT,
    sent_at      TIMESTAMP,
    PRIMARY KEY (group_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

-- Per-user delivery tracking (NOT message copy)
CREATE TABLE message_receipts (
    message_id   UUID,
    user_id      UUID,
    delivered_at TIMESTAMP,
    read_at      TIMESTAMP,
    PRIMARY KEY (message_id, user_id)
);
-- 1000 member group = 1 msg + 999 receipt rows (not 1000 msg copies)

💡 Message Schema in Cassandra

Partition key = (chat_id), Clustering key = (message_id DESC)। এতে latest messages সামনে আসে। Pagination করা সহজ। Time-ordered messages efficiently store হয়।

007Group Chat & Pub/Sub

Group Chat Architecture এবং Pub/Sub

⚠️ Group Chat Fanout Problem

1000 member group-এ 1 message → 999 users-এ deliver করতে হবে। Fan-out problem! Solution: Message DB-তে 1 copy রাখুন, per-user read status track করুন। Message copy করুন না — pointer রাখুন।

schema.sql — Group Chat
-- Group messages: 1 row per message (not per user)
CREATE TABLE group_messages (
    msg_id       UUID PRIMARY KEY,
    group_id     UUID,
    sender_id    UUID,
    content      TEXT,
    sent_at      TIMESTAMP
);

-- Per-user delivery tracking (not message copy)
CREATE TABLE message_receipts (
    msg_id       UUID,
    user_id      UUID,
    delivered_at TIMESTAMP,
    read_at      TIMESTAMP,
    PRIMARY KEY (msg_id, user_id)
);
-- 1000 member group = 1 msg + 999 receipt rows (not 1000 msg copies)

Kafka Pub/Sub — Cross-server Message Routing

প্রতিটা Chat Server-এর জন্য একটা Kafka topic থাকে। User A (Server 1) → User B (Server 5) পাঠাতে হলে, Server 1 Kafka topic "server-5"-এ publish করে। Server 5 সেই topic consume করে User B-কে push করে।

KAFKA — Cross-Server Routing Flow

Chat Server 1

User A connected

Kafka Topic

"chat:server-5"

Chat Server 5

User B connected

User B

Gets message!

008Advanced Features

Push Notifications, Presence, Encryption এবং Interview Tips

Push Notifications — Offline Users-এর জন্য

User offline থাকলে WebSocket নেই। Message DB-তে store হওয়ার পর FCM (Firebase Cloud Messaging) / APNs (Apple Push Notification Service) দিয়ে device-এ push notification পাঠানো হয়।

Online Presence System

  • User online হলে Redis-এ user:ID:online = true
  • TTL: 60 seconds — heartbeat দিয়ে refresh
  • Disconnect হলে TTL expire → automatically "offline"
  • Last seen timestamp Redis-এ store

End-to-End Encryption

  • Signal Protocol use করে WhatsApp
  • Server শুধু ciphertext দেখে
  • শুধু sender ও receiver decrypt করতে পারে
  • Trade-off: spam detection impossible

Scaling Strategies

Strategy

Horizontal Chat Servers: লক্ষ লক্ষ WebSocket connections ধরে রাখতে অনেক Chat servers। Load balancer user-কে consistent server-এ route করে (sticky sessions)।

Strategy

Service Discovery via Zookeeper: প্রতিটা Chat server register হয় Zookeeper-এ। User কোন server-এ connected তা Redis-এ cache থাকে।

Strategy

Kafka for Cross-server Routing: User A (Server 1) → User B (Server 5)। Kafka topic per server। Server 5-এর Kafka topic-এ message publish → Server 5 consume করে User B-কে push।

Trade-off

End-to-End Encryption: E2E encryption মানে server message decrypt করতে পারে না। Spam detection, moderation কঠিন হয়। Privacy vs Safety trade-off।

Full Tech Stack

Backend

Erlang (Chat Servers)Go (API Services)WebSocket (Real-time)Nginx + HAProxy

Data

Cassandra (Messages)MySQL (Users)Redis (Presence + Routing)S3 (Media)

Infrastructure

Apache KafkaZookeeper (Service Discovery)Signal Protocol (E2E Encryption)Kubernetes

Database Choice — কোন Database কিসের জন্য?

DataDatabaseWhy?
MessagesCassandra (HBase/Scylla)Append-only, time-series, massive scale
User profilesMySQLStructured, ACID
Online statusRedis (TTL)In-memory, fast, TTL for "last seen"
User-server mappingRedisWhich server holds user's connection
Media filesS3 + CDNObject storage
Message searchElasticsearchFull-text search in chat history

🎯 Interview Tips — WhatsApp Design

1) সবার আগে বলুন: "WebSocket দরকার real-time-এর জন্য, HTTP polling কাজ করবেন না।"

2) Cross-server routing explain করুন: Redis (user → server mapping) + Kafka (message routing)।

3) Group chat fanout problem mention করুন: 1 message copy + receipts table।

4) Offline delivery: Cassandra persist + online হলে push।

5) Erlang mention করলেন bonus points — massive concurrent connections।

009Lesson Summary

SUMMARY — আজকে যা শিখলাম

ChallengeSolutionTechnology
Real-time deliveryPersistent WebSocketErlang/Go
Cross-server routingPub/sub messagingKafka
Offline deliveryPersist + deliver on reconnectCassandra
Online presenceRedis TTL + heartbeatRedis
Group chat storage1 message + receipt tableCassandra
PrivacyEnd-to-end encryptionSignal Protocol
HTTP vs WebSocketWebSocket = persistent, bidirectional< 50ms latency
010Knowledge Check
011Assignments
012Practical Lab