Design WhatsApp/Chat System
Chat System-এর Unique Challenges
WhatsApp-এ প্রতিদিন 100 billion messages send হয়। এই system-এ সবচেয়ে কঠিন সমস্যা হলো real-time delivery — message পাঠানো সাথে সাথে receiver পাবেন, এমনকি receiver offline থাকলেও পরে পাবেন।
WHATSAPP — Message Flow Overview
📌 Core Challenge
Push vs Pull:Traditional HTTP request-response কাজ করে না real-time messaging-এ। Receiver সব সময় "নতুন message আছে?" জিজ্ঞেস করতে পারে না। Server-ই push করতে হবে। এই জন্য WebSocket দরকার।
Features কী কী?
Functional Requirements
- →1-to-1 messaging
- →Group messaging (max 1000 members)
- →Online/offline status
- →Message delivery status (✓ ✓✓ 🔵)
- →Media sharing (image, video, file)
- →End-to-end encryption
- →Last seen timestamp
- →Message history
Non-Functional Requirements
- →2 billion+ users
- →Message delivery < 500ms
- →No message loss ever
- →Offline message delivery (when comes online)
- →High availability 99.99%
- →End-to-end encrypted
WhatsApp Scale
Storage Estimation
Daily storage: 100B messages × 1KB = 100TB/day
5-year storage: 100TB × 365 × 5 = ~182 PB
Write throughput: 1.15M msg/sec × 1KB = 1.15 GB/sec
🔢 Interesting Fact
WhatsApp মাত্র 50 engineers দিয়ে 450M users serve করত (2014 সালে)। Erlang language use করত — concurrent connections-এর জন্য world-class। এরপর Facebook $19 billion-এ কিনে নেয়।
Chat System Architecture
WhatsApp-এর architecture-এর core হলো Chat Servers যেগুলো user-দের WebSocket connections maintain করে। Users different servers-এ connected থাকে, তাই cross-server routing-এর জন্য Kafka use করা হয়।
WHATSAPP — Full System Architecture
Client Layer
WebSocket persistent connection
Server Layer
Storage Layer
Step 1 — User A connects via WebSocket
App open করলেন Chat Server-এ persistent WebSocket connection establish হয়। Redis-এ user_id → server_id mapping save হয়।
Step 2 — Message Send করা হয়
User A message পাঠায়। Chat Server 1 message receive করে, Cassandra-তে persist করে (durability ensure করতে)।
Step 3 — Cross-server routing (যদি দরকার হয়)
Redis check করে User B কোন server-এ। Same server হলে direct push। Different server হলে Kafka-তে route করুন।
Step 4 — User B-তে Deliver
Chat Server 2 Kafka থেকে message consume করে User B-এর WebSocket-এ push করে। Double tick (✓✓) send হয়।
Step 5 — Offline handling
User C offline থাকলে message Cassandra-তে stored থাকে। Online হলে pending messages fetch করে push করা হয়।
WebSocket Connection Management
HTTP request-response model real-time chat-এর জন্য suitable না। WebSocket একটা persistent bidirectional connection রাখে — server যেকোনো সময় client-কে push করতে পারে।
| Protocol | Connection | Direction | Latency | Chat Use? |
|---|---|---|---|---|
| HTTP Polling | New each time | Client → Server only | High (500ms+) | Terrible |
| Long Polling | Held open | Server can respond | Medium | Okay |
| Server-Sent Events | Persistent | Server → Client only | Low | One-way only |
| WebSocket | Persistent | Both directions | Very Low (< 50ms) | ✓ Perfect |
import asyncio
import websockets
import json
# Active connections: user_id → websocket
connections = {}
async def handle_connection(websocket, path):
user_id = await authenticate(websocket)
# Connection register করুন
connections[user_id] = websocket
await redis.set(f"user:{user_id}:server", SERVER_ID)
await redis.set(f"user:{user_id}:online", "true")
try:
async for raw_msg in websocket:
msg = json.loads(raw_msg)
await send_message(
sender_id=user_id,
receiver_id=msg['to'],
content=msg['content']
)
finally:
# Disconnect হলে cleanup
del connections[user_id]
await redis.set(f"user:{user_id}:online", "false")
await redis.set(f"user:{user_id}:last_seen", now())
async def send_message(sender_id, receiver_id, content):
msg_id = generate_unique_id()
# 1. DB-তে persist করুন আগে
await db.save_message(msg_id, sender_id, receiver_id, content)
# 2. Receiver কোন server-এ? Redis check করুন
receiver_server = await redis.get(f"user:{receiver_id}:server")
if receiver_server == SERVER_ID:
# Same server-এ আছে → direct push
ws = connections.get(receiver_id)
if ws:
await ws.send(json.dumps({"msg_id": msg_id, "content": content}))
else:
# Different server → Kafka-তে route করুন
await kafka.publish(f"chat:{receiver_server}", {
"receiver_id": receiver_id, "msg_id": msg_id
})💡 Heartbeat Mechanism
WebSocket connection alive রাখতে ping/pong heartbeatদরকার। Client প্রতি 30 sec ping পাঠায়, server pong দেয়। Redis-এ user online key refresh হয়। 60 sec কোনো response না পেলে connection dead — user "offline" mark করুন।
Message Storage এবং Delivery Receipts
Message Delivery Status — ✓ ✓✓ 🔵
| Status | Meaning | When? |
|---|---|---|
| ✓ (single tick) | Server received | Message DB-তে save হলে |
| ✓✓ (double tick) | Delivered to device | Receiver-এর phone-এ পৌঁছালে |
| 🔵 (blue tick) | Read | Receiver message open করলেন |
Cassandra Message Schema
Messages time-series data। Cassandra-তে partition key = chat_id, clustering key = message_id DESC— latest messages সামনে আসে, pagination সহজ।
-- 1-to-1 Messages Table
CREATE TABLE messages (
chat_id UUID,
message_id TIMEUUID, -- Time-ordered UUID
sender_id UUID,
receiver_id UUID,
content TEXT,
msg_type TEXT, -- 'text', 'image', 'video'
status TEXT, -- 'sent', 'delivered', 'read'
created_at TIMESTAMP,
PRIMARY KEY (chat_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
-- Partition by chat_id = same conversation একই partition-এ
-- Latest messages DESC = pagination efficient
-- Group Messages (1 copy per message, not per user)
CREATE TABLE group_messages (
group_id UUID,
message_id TIMEUUID,
sender_id UUID,
content TEXT,
sent_at TIMESTAMP,
PRIMARY KEY (group_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
-- Per-user delivery tracking (NOT message copy)
CREATE TABLE message_receipts (
message_id UUID,
user_id UUID,
delivered_at TIMESTAMP,
read_at TIMESTAMP,
PRIMARY KEY (message_id, user_id)
);
-- 1000 member group = 1 msg + 999 receipt rows (not 1000 msg copies)💡 Message Schema in Cassandra
Partition key = (chat_id), Clustering key = (message_id DESC)। এতে latest messages সামনে আসে। Pagination করা সহজ। Time-ordered messages efficiently store হয়।
Group Chat Architecture এবং Pub/Sub
⚠️ Group Chat Fanout Problem
1000 member group-এ 1 message → 999 users-এ deliver করতে হবে। Fan-out problem! Solution: Message DB-তে 1 copy রাখুন, per-user read status track করুন। Message copy করুন না — pointer রাখুন।
-- Group messages: 1 row per message (not per user)
CREATE TABLE group_messages (
msg_id UUID PRIMARY KEY,
group_id UUID,
sender_id UUID,
content TEXT,
sent_at TIMESTAMP
);
-- Per-user delivery tracking (not message copy)
CREATE TABLE message_receipts (
msg_id UUID,
user_id UUID,
delivered_at TIMESTAMP,
read_at TIMESTAMP,
PRIMARY KEY (msg_id, user_id)
);
-- 1000 member group = 1 msg + 999 receipt rows (not 1000 msg copies)Kafka Pub/Sub — Cross-server Message Routing
প্রতিটা Chat Server-এর জন্য একটা Kafka topic থাকে। User A (Server 1) → User B (Server 5) পাঠাতে হলে, Server 1 Kafka topic "server-5"-এ publish করে। Server 5 সেই topic consume করে User B-কে push করে।
KAFKA — Cross-Server Routing Flow
Chat Server 1
User A connected
Kafka Topic
"chat:server-5"
Chat Server 5
User B connected
User B
Gets message!
Push Notifications, Presence, Encryption এবং Interview Tips
Push Notifications — Offline Users-এর জন্য
User offline থাকলে WebSocket নেই। Message DB-তে store হওয়ার পর FCM (Firebase Cloud Messaging) / APNs (Apple Push Notification Service) দিয়ে device-এ push notification পাঠানো হয়।
Online Presence System
- →User online হলে Redis-এ
user:ID:online = true - →TTL: 60 seconds — heartbeat দিয়ে refresh
- →Disconnect হলে TTL expire → automatically "offline"
- →Last seen timestamp Redis-এ store
End-to-End Encryption
- →Signal Protocol use করে WhatsApp
- →Server শুধু ciphertext দেখে
- →শুধু sender ও receiver decrypt করতে পারে
- →Trade-off: spam detection impossible
Scaling Strategies
Horizontal Chat Servers: লক্ষ লক্ষ WebSocket connections ধরে রাখতে অনেক Chat servers। Load balancer user-কে consistent server-এ route করে (sticky sessions)।
Service Discovery via Zookeeper: প্রতিটা Chat server register হয় Zookeeper-এ। User কোন server-এ connected তা Redis-এ cache থাকে।
Kafka for Cross-server Routing: User A (Server 1) → User B (Server 5)। Kafka topic per server। Server 5-এর Kafka topic-এ message publish → Server 5 consume করে User B-কে push।
End-to-End Encryption: E2E encryption মানে server message decrypt করতে পারে না। Spam detection, moderation কঠিন হয়। Privacy vs Safety trade-off।
Full Tech Stack
Backend
Data
Infrastructure
Database Choice — কোন Database কিসের জন্য?
| Data | Database | Why? |
|---|---|---|
| Messages | Cassandra (HBase/Scylla) | Append-only, time-series, massive scale |
| User profiles | MySQL | Structured, ACID |
| Online status | Redis (TTL) | In-memory, fast, TTL for "last seen" |
| User-server mapping | Redis | Which server holds user's connection |
| Media files | S3 + CDN | Object storage |
| Message search | Elasticsearch | Full-text search in chat history |
🎯 Interview Tips — WhatsApp Design
1) সবার আগে বলুন: "WebSocket দরকার real-time-এর জন্য, HTTP polling কাজ করবেন না।"
2) Cross-server routing explain করুন: Redis (user → server mapping) + Kafka (message routing)।
3) Group chat fanout problem mention করুন: 1 message copy + receipts table।
4) Offline delivery: Cassandra persist + online হলে push।
5) Erlang mention করলেন bonus points — massive concurrent connections।
SUMMARY — আজকে যা শিখলাম
| Challenge | Solution | Technology |
|---|---|---|
| Real-time delivery | Persistent WebSocket | Erlang/Go |
| Cross-server routing | Pub/sub messaging | Kafka |
| Offline delivery | Persist + deliver on reconnect | Cassandra |
| Online presence | Redis TTL + heartbeat | Redis |
| Group chat storage | 1 message + receipt table | Cassandra |
| Privacy | End-to-end encryption | Signal Protocol |
| HTTP vs WebSocket | WebSocket = persistent, bidirectional | < 50ms latency |