Cloud Architecture & IaC (AWS, Terraform, Serverless)
Cloud কেন বদলে দিল সব কিছু?
২০০০-এর দশকের শুরুতে একটা startup শুরু করতে হলে প্রথমে physical server কিনতে হতো, data center rent করতে হতো, network configure করতে হতো — software লেখার আগেই লাখ টাকা খরচ। Cloud এই reality সম্পূর্ণ বদলে দিয়েছে।
❌ Before Cloud — On-Premise
- →Hardware কিনতে ৩-৬ মাস lead time
- →Peak load-এর জন্য over-provision করতে হতো
- →Hardware failure = downtime
- →Capacity planning ভুল হলে ক্ষতি
- →Global expansion = নতুন data center
- →Ops team সার্বক্ষণিক রাখতে হতো
- →CapEx (Capital Expenditure) বিশাল
✅ After Cloud — AWS / GCP / Azure
- →Minutes-এ server spin up করুন
- →Traffic বাড়লে auto-scale করুন
- →AWS manages hardware failures
- →Pay only what you use (OpEx)
- →20+ global regions available
- →Managed services (RDS, SQS, Lambda)
- →Focus on product, not infrastructure
📌 Cloud এর ৩টা Model — IaaS, PaaS, SaaS
IaaS (Infrastructure as a Service): Raw infrastructure ভাড়া দেয়। আপনি OS, runtime, application সব manage করুন। Example: AWS EC2, GCP Compute Engine, Azure VM। সবচেয়ে বেশি control, সবচেয়ে বেশি responsibility।
PaaS (Platform as a Service): Platform দেয় — আপনি শুধু code deploy করুন। OS, runtime, scaling সব managed। Example: AWS Elastic Beanstalk, Heroku, Google App Engine। Developer productivity বেশি, কিন্তু কম control।
SaaS (Software as a Service): Complete software subscription হিসেবে। কোনো infrastructure manage করতে হয় না। Example: Gmail, Slack, Salesforce, GitHub। End-user software — just use it।
| Model | আপনি manage করুন | Provider manage করে | Example |
|---|---|---|---|
| IaaS | OS, Runtime, App, Data | Virtualization, Network, Storage, Hardware | EC2, Azure VM |
| PaaS | App, Data | OS, Runtime, Middleware, Infra | Heroku, Beanstalk |
| SaaS | শুধু use করুন | সব কিছু | Gmail, Slack |
AWS Core Services — System Design View
AWS-এ ২০০+ services আছে। System design-এর জন্য মূল services জানলেনই চলে। প্রতিটা service একটা specific problem solve করে।
AWS Architecture — Core Services Overview
| Category | AWS Service | Use Case |
|---|---|---|
| Compute | EC2 / Lambda / ECS / EKS | Server, serverless function, container, Kubernetes |
| Storage | S3 / EBS / EFS | Object storage, block disk (EC2), shared file system |
| Database | RDS / DynamoDB / ElastiCache | Managed SQL, NoSQL key-value, in-memory cache |
| Network | VPC / CloudFront / Route53 | Isolated network, CDN edge caching, DNS routing |
| Queue/Event | SQS / SNS / EventBridge | Message queue, pub/sub notification, event bus |
| Security | IAM / KMS / WAF | Identity management, encryption keys, firewall |
Serverless Architecture — Lambda & FaaS
Serverless মানে server নেই এমন না — মানে আপনি server manage করুন না। AWS Lambda হলো সবচেয়ে popular FaaS (Function as a Service)। আপনি শুধু function লিখুন, AWS execution, scaling, patching সব করে।
⚡
Event-driven
HTTP request, S3 upload, SQS message, scheduled cron — যেকোনো event-এ trigger হয়
💰
Pay-per-use
Function run করলেন তখনই charge। Idle থাকলে কোনো cost নেই। 1M free invocations/month
📈
Auto-scaling
Concurrent invocations automatically scale। 1 request বা 1M request — same code
import json
import boto3
# AWS Lambda function — API Gateway থেকে triggered
def lambda_handler(event, context):
"""
event: API Gateway request (HTTP method, path, body, headers)
context: Lambda runtime info (timeout, memory, request_id)
"""
http_method = event.get('httpMethod', 'GET')
path = event.get('path', '/')
body = json.loads(event.get('body') or '{}')
# DynamoDB client (Lambda-র IAM role থেকে auto-auth)
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('users')
if http_method == 'POST' and path == '/users':
# User create করুন
user_id = body.get('user_id')
name = body.get('name')
table.put_item(Item={
'user_id': user_id,
'name': name,
'created_at': str(__import__('datetime').datetime.utcnow())
})
return {
'statusCode': 201,
'headers': {'Content-Type': 'application/json'},
'body': json.dumps({'message': 'User created', 'user_id': user_id})
}
elif http_method == 'GET' and path.startswith('/users/'):
# User fetch করুন
user_id = path.split('/')[-1]
response = table.get_item(Key={'user_id': user_id})
user = response.get('Item')
if not user:
return {'statusCode': 404, 'body': json.dumps({'error': 'Not found'})}
return {
'statusCode': 200,
'headers': {'Content-Type': 'application/json'},
'body': json.dumps(user)
}
return {'statusCode': 400, 'body': json.dumps({'error': 'Bad request'})}
# Lambda-র সুবিধা:
# - Server manage করতে হয় না
# - Auto-scale (0 to 10000+ concurrent)
# - Pay only for execution time (100ms billing)
# - Max execution: 15 minutes, Max memory: 10GBClient — HTTP Request
User app থেকে POST /api/users request আসে। DNS Route53-এ hit করে API Gateway-র endpoint resolve করে।
API Gateway — Request Validation
API Gateway request receive করে। Auth check (JWT/API key), rate limiting, request validation করে। Valid হলে Lambda invoke করে।
AWS Lambda — Cold/Warm Start
Lambda container spin up করে (cold start ~100-500ms) অথবা existing warm container reuse করে। Function execute হয়।
DynamoDB — Data Persist
Lambda DynamoDB-তে data write করে। IAM role automatically authentication handle করে। Serverless DB — no connection pool management।
Response — API Gateway → Client
Lambda response (statusCode, body) API Gateway-তে return করে। API Gateway HTTP response format করে client-এ পাঠায়।
| Feature | Serverless (Lambda) | Containers (ECS) | VMs (EC2) |
|---|---|---|---|
| Cost Model | Pay per request | Pay per container runtime | Pay per hour (even idle) |
| Cold Start | 100ms - 2s (issue) | Seconds (pre-warmed) | Minutes (rare) |
| Scaling | Instant auto-scale | Minutes (ECS task) | Minutes (new EC2) |
| Ops Overhead | Near zero | Medium (Dockerfile) | High (OS patches) |
| Max Runtime | 15 minutes | Unlimited | Unlimited |
| Best For | Event processing, APIs | Long-running services | Legacy apps, full control |
Infrastructure as Code — Terraform
Manual clicks দিয়ে AWS console-এ infrastructure তৈরি করলেন reproducibility নেই, version control নেই, team collaboration কঠিন। Terraform দিয়ে infrastructure code হিসেবে define করলেন এই সব সমস্যা solve হয়।
❌ Manual Setup — Problems
- ✗Staging ≠ Production (drift)
- ✗কে কী change করেছেনে — জানা নেই
- ✗Disaster recovery কঠিন
- ✗New region setup = manual redo
- ✗Security misconfiguration সহজ
✅ Terraform IaC — Benefits
- ✓Code = infrastructure (git versioned)
- ✓terraform plan = preview changes
- ✓Reproducible environments
- ✓Multi-region = variable swap
- ✓PR review = infra review
# Provider configuration
provider "aws" {
region = var.aws_region # "ap-southeast-1" for Singapore
}
# ============================================================
# VPC — Isolated Network
# ============================================================
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "${var.project_name}-vpc"
Environment = var.environment
}
}
# Public subnet (EC2, Load Balancer)
resource "aws_subnet" "public" {
count = 2
vpc_id = aws_vpc.main.id
cidr_block = "10.0.${count.index}.0/24"
availability_zone = data.aws_availability_zones.available.names[count.index]
map_public_ip_on_launch = true
tags = { Name = "${var.project_name}-public-${count.index}" }
}
# Private subnet (RDS, ElastiCache)
resource "aws_subnet" "private" {
count = 2
vpc_id = aws_vpc.main.id
cidr_block = "10.0.${count.index + 10}.0/24"
availability_zone = data.aws_availability_zones.available.names[count.index]
tags = { Name = "${var.project_name}-private-${count.index}" }
}
# ============================================================
# EC2 — Application Server (Auto Scaling Group)
# ============================================================
resource "aws_launch_template" "app" {
name_prefix = "${var.project_name}-"
image_id = data.aws_ami.amazon_linux.id
instance_type = var.instance_type # "t3.medium"
vpc_security_group_ids = [aws_security_group.app.id]
user_data = base64encode(<<-EOF
#!/bin/bash
yum update -y
yum install -y docker
systemctl start docker
docker run -d -p 80:8080 ${var.app_image}
EOF
)
tag_specifications {
resource_type = "instance"
tags = { Name = "${var.project_name}-app" }
}
}
resource "aws_autoscaling_group" "app" {
name = "${var.project_name}-asg"
desired_capacity = var.desired_capacity
min_size = var.min_capacity
max_size = var.max_capacity
vpc_zone_identifier = aws_subnet.public[*].id
launch_template {
id = aws_launch_template.app.id
version = "$Latest"
}
}
# ============================================================
# RDS — Managed PostgreSQL
# ============================================================
resource "aws_db_subnet_group" "main" {
name = "${var.project_name}-db-subnet"
subnet_ids = aws_subnet.private[*].id
}
resource "aws_db_instance" "postgres" {
identifier = "${var.project_name}-db"
engine = "postgres"
engine_version = "15.3"
instance_class = "db.t3.medium"
allocated_storage = 100
storage_encrypted = true
db_name = var.db_name
username = var.db_username
password = var.db_password # Use AWS Secrets Manager in production!
db_subnet_group_name = aws_db_subnet_group.main.name
vpc_security_group_ids = [aws_security_group.db.id]
multi_az = true # High availability — 2 AZ
skip_final_snapshot = false
deletion_protection = true
tags = { Name = "${var.project_name}-postgres" }
}
# Output: RDS endpoint (app config-এ use করুন)
output "db_endpoint" {
value = aws_db_instance.postgres.endpoint
sensitive = true
}💡 Terraform State Management — S3 Backend
Terraform একটা state file (terraform.tfstate) maintain করে — current infrastructure-এর snapshot। Local machine-এ রাখলে team collaboration সমস্যা হয়।
Production solution: S3 backend use করুন। State S3-এ store হয়, DynamoDB-তে locking থাকে যেন দুজন একসাথে apply না করতে পারে:
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "prod/terraform.tfstate"
region = "ap-southeast-1"
encrypt = true
dynamodb_table = "terraform-lock"
}
}Multi-Region Architecture
Single region-এ থাকলে region outage হলে সব down হয়ে যায়। Multi-region architecture দিয়ে higher availability এবং lower latency globally achieve করা যায়।
| Feature | Active-Active | Active-Passive |
|---|---|---|
| Traffic Handling | সব regions live traffic serve করে | Primary region serve করে, secondary standby |
| Failover Time | Instant (no failover needed) | Minutes (DNS TTL + failover) |
| Cost | বেশি (দুটো full setup) | কম (secondary idle/minimal) |
| Consistency | Complex — conflict resolution দরকার | Simpler — one write region |
| Use Case | Global apps, low latency worldwide | DR (Disaster Recovery), compliance |
Multi-Region Architecture — Active-Active
⚠️ Multi-Region এর Challenges
Data Sovereignty: কিছু দেশে (Germany, India) user data দেশের বাইরে store করা যায় না (GDPR, DPDP Act)। Multi-region design-এ data residency compliance আগে verify করতে হবে।
Consistency Challenge: দুটো region-এ একই data লিখলে conflict হতে পারে। Last-write-wins, CRDT, বা single primary write region strategy choose করতে হবে। CAP theorem-এ Partition tolerance নিশ্চিত করতে হয়।
Latency Trade-off: Cross-region replication-এ asynchronous replication করলেন RPO (Recovery Point Objective) কয়েক সেকেন্ড। Synchronous হলে write latency বাড়ে।
Cloud Cost Optimization
Cloud-এর সবচেয়ে বড় ভুল হলো cost optimize না করা। নতুন startups AWS bill দেখে চমকে যায়। সঠিক instance type এবং pricing model বেছে নিলে ৫০-৭০% cost কমানো সম্ভব।
| Pricing Model | Cost | Commitment | Best For |
|---|---|---|---|
| On-Demand | সবচেয়ে বেশি (baseline) | কোনো commitment নেই | Unpredictable, short-term workloads |
| Reserved | ৪০-৬০% সস্তা | 1 or 3 year commitment | Steady-state production workloads |
| Savings Plans | ৪০-৬৬% সস্তা | 1 or 3 year (flexible) | Lambda, Fargate, EC2 — flexible |
| Spot Instances | ৭০-৯০% সস্তা | Interruptible (2-min notice) | Batch jobs, ML training, fault-tolerant |
Auto Scaling Strategy — Cost vs Performance
Target Tracking
CPU 60% target রাখুন। Traffic বাড়লে instance add, কমলে remove। Simplest approach।
Scheduled Scaling
Office hour-এ 10 instances, রাতে 2 instances। Predictable patterns-এ effective।
Step Scaling
CPU 70% = +2 instances, CPU 90% = +5 instances। Graduated response।
# Target Tracking Auto Scaling Policy
resource "aws_autoscaling_policy" "cpu_target" {
name = "${var.project_name}-cpu-target"
autoscaling_group_name = aws_autoscaling_group.app.name
policy_type = "TargetTrackingScaling"
target_tracking_configuration {
predefined_metric_specification {
predefined_metric_type = "ASGAverageCPUUtilization"
}
# CPU 60% target — over = scale out, under = scale in
target_value = 60.0
# Scale in cool-down: 300s (5 min) — sudden traffic drop-এ overreact না
# Scale out cool-down: 60s (1 min) — traffic spike দ্রুত handle
disable_scale_in = false
}
}
# Scheduled Scaling — Office hours (predictable load)
resource "aws_autoscaling_schedule" "scale_up_morning" {
scheduled_action_name = "scale-up-morning"
autoscaling_group_name = aws_autoscaling_group.app.name
# Mon-Fri 9 AM UTC+6 = 3 AM UTC
recurrence = "0 3 * * MON-FRI"
desired_capacity = 10
min_size = 5
max_size = 50
}
resource "aws_autoscaling_schedule" "scale_down_night" {
scheduled_action_name = "scale-down-night"
autoscaling_group_name = aws_autoscaling_group.app.name
# 11 PM UTC+6 = 5 PM UTC
recurrence = "0 17 * * MON-FRI"
desired_capacity = 2
min_size = 2
max_size = 10
}
# CloudWatch Alarm — High CPU alert
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
alarm_name = "${var.project_name}-high-cpu"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = 60
statistic = "Average"
threshold = 80
alarm_description = "EC2 CPU over 80%"
dimensions = {
AutoScalingGroupName = aws_autoscaling_group.app.name
}
alarm_actions = [aws_sns_topic.alerts.arn] # SNS → PagerDuty / Slack
}🎯 Cost Monitoring — AWS Cost Explorer
AWS Cost Explorer দিয়ে daily/monthly spend visualize করুন। Service-wise breakdown দেখুন — কোন service সবচেয়ে বেশি charge করছে।
AWS Budgets: Monthly budget set করুন। 80% বা 100% reach করলেন email/SMS alert। Surprise bill এড়ানোর সবচেয়ে সহজ উপায়।
Cost Anomaly Detection: Machine learning দিয়ে unusual spend detect করে। কোনো runaway process বা misconfiguration আগেই alert করে।
Quick wins: S3 Intelligent-Tiering on, unused EBS volumes delete করুন, old snapshots clean করুন, NAT Gateway traffic optimize করুন।
Cloud-Native Design Patterns
Cloud-native application design-এ কিছু proven patterns আছে যা resilience, scalability, এবং maintainability নিশ্চিত করে।
Circuit Breaker Pattern
Downstream service fail হলে বারবার retry করলেন cascade failure হয়। Circuit Breaker failure count track করে। Threshold পার হলে circuit "open" করে — সাথে সাথে fallback return করে। Timeout পর "half-open" state-এ test করে। Service recover হলে circuit "closed" করে।
Retry with Exponential Backoff
Transient failure-এ (network hiccup, throttling) retry করুন। কিন্তু immediate retry আবার overwhelm করতে পারে। Exponential backoff: 1s, 2s, 4s, 8s... + jitter (random delay) যেন সব clients একসাথে retry না করে।
Bulkhead Pattern
Ship-এর bulkhead যেমন hull breach isolate করে, software bulkhead resource pool isolate করে। Payment service-এর thread pool আলাদা রাখুন যেন user service slow হলে payment affect না হয়। AWS-এ: separate Lambda concurrency limits, separate SQS queues।
Sidecar Pattern (Service Mesh)
Application container-এর পাশে একটা sidecar proxy container চলে (Envoy)। Service discovery, load balancing, mTLS, circuit breaking — সব sidecar handle করে। App code একদম clean থাকে। Istio এই pattern use করে Kubernetes-এ।
12-Factor App Principles (Cloud-Native Foundation)
1. Codebase
One codebase, tracked in VCS
2. Dependencies
Explicitly declare & isolate
3. Config
Store in environment (not code)
4. Backing Services
Treat as attached resources
5. Build/Release/Run
Strictly separate stages
6. Processes
Stateless, share-nothing
7. Port Binding
Export services via port
8. Concurrency
Scale via process model
9. Disposability
Fast startup, graceful shutdown
10. Dev/Prod Parity
Keep environments similar
11. Logs
Treat as event streams
12. Admin Processes
Run as one-off processes
# Kubernetes Pod-এ Application + Envoy Sidecar
apiVersion: v1
kind: Pod
metadata:
name: payment-service
labels:
app: payment
version: v1
spec:
containers:
# ─── Main Application Container ───────────────────────────────
- name: payment-app
image: myregistry/payment-service:latest
ports:
- containerPort: 8080
env:
- name: DB_HOST
valueFrom:
secretKeyRef:
name: payment-secrets
key: db-host
# App code শুধু business logic — networking concern নেই
# ─── Envoy Sidecar Proxy (Service Mesh) ──────────────────────
- name: envoy-proxy
image: envoyproxy/envoy:v1.28
ports:
- containerPort: 9901 # Admin port
- containerPort: 15001 # Outbound proxy
- containerPort: 15006 # Inbound proxy
# Sidecar এর দায়িত্ব:
# ✓ mTLS (mutual TLS) — service-to-service encryption
# ✓ Circuit breaking — downstream failure isolation
# ✓ Retry + timeout — configurable per-route
# ✓ Load balancing — across upstream pods
# ✓ Observability — metrics, traces automatically
volumeMounts:
- name: envoy-config
mountPath: /etc/envoy
volumes:
- name: envoy-config
configMap:
name: envoy-config
# Istio আপনার সব pods-এ automatically এই sidecar inject করে
# আপনার app code change করতে হয় না!Cloud Architecture Interview Tips
System design interview-এ cloud নিয়ে জিজ্ঞেস করলেন সঠিক service choose করাটা important। কোন situation-এ কোন AWS service use করবেন — এটা clearly বলতে পারলেন interviewer impress হবে।
| Scenario | Choose This | কেন? |
|---|---|---|
| Short, event-driven function (< 15 min) | Lambda | Serverless, pay-per-use, zero ops |
| Long-running containerized service | ECS / EKS | Docker containers, managed orchestration |
| Traditional VM, full OS control | EC2 | IaaS, any workload, legacy app |
| Managed SQL database | RDS / Aurora | PostgreSQL/MySQL managed, Multi-AZ |
| Key-value NoSQL, massive scale | DynamoDB | Serverless DB, single-digit ms, unlimited scale |
| In-memory caching layer | ElastiCache (Redis) | Microsecond reads, session store, rate limiting |
| Async task queue / decoupling | SQS | At-least-once delivery, decouple services |
| Pub/sub fan-out notification | SNS | Topic → multiple subscribers simultaneously |
| Static files / media / backups | S3 | Unlimited object storage, 99.999999999% durability |
| Global CDN / edge caching | CloudFront | 450+ PoP worldwide, reduces origin load |
Common Cloud Architecture Patterns — Interview Scenarios
Scenario: E-commerce checkout surge
SQS queue + Lambda consumer। Checkout request → SQS। Lambda asynchronously process করে। Queue absorbs burst traffic। Database overwhelm হয় না।
Scenario: Image upload + resize
S3 PUT → S3 Event Notification → Lambda trigger। Lambda image resize করে processed/ folder-এ save করে। CloudFront দিয়ে serve।
Scenario: Microservices communication
Synchronous: API Gateway + Lambda/ECS। Asynchronous: EventBridge/SNS fan-out। Service mesh: ECS + App Mesh (Envoy)।
Scenario: Multi-region disaster recovery
Route53 health check + failover routing। Primary region down → automatic DNS failover to secondary। RDS read replica promote করুন। RPO: minutes, RTO: minutes।
💡 System Design Interview-এ Cloud নিয়ে কথা বলার উপায়
1) Service choice justify করুন: শুধু "S3 use করব" বলুন না — বলুন "Object storage-এর জন্য S3 use করব কারণ unlimited scale, 11 nine durability, এবং CloudFront integration।"
2) Managed vs Self-managed trade-off: RDS vs self-hosted PostgreSQL on EC2 — RDS বেশি expensive কিন্তু automated backups, Multi-AZ, patching। Ops team ছোট হলে RDS worth it।
3) Cost awareness দেখাও: On-demand vs Reserved instance বলুন। Spot instance কখন viable সেটা বলুন। Cost-conscious architect valuable।
4) Security layer mention করুন: IAM roles for services, VPC private subnets for databases, KMS encryption at rest, WAF for API protection।
5) Avoid cloud-specific lock-in concern: Interviewer জিজ্ঞেস করতে পারে vendor lock-in নিয়ে। বলুন: abstraction layer (Terraform, Kubernetes) use করলেন migration easier।
SUMMARY — আজকে যা শিখলাম
| Topic | Key Concept | AWS Service | Interview Point |
|---|---|---|---|
| Cloud Models | IaaS / PaaS / SaaS — control vs abstraction tradeoff | EC2 / Beanstalk / Gmail | Model সঠিকভাবে explain করতে পারা |
| Serverless | Event-driven, pay-per-use, no server management | Lambda + API Gateway + DynamoDB | Cold start limitation জানা জরুরি |
| IaC | Infrastructure = code, reproducible, version controlled | Terraform + S3 state backend | terraform plan/apply workflow জানেন |
| Multi-Region | Active-Active vs Active-Passive, data sovereignty | Route53 + RDS Replica + S3 Replication | Consistency challenge বলতে পারা |
| Cost | On-Demand vs Reserved vs Spot instances | Cost Explorer + Budgets + Auto Scaling | Spot instance use case — batch/ML |
| Patterns | Circuit Breaker, Bulkhead, Sidecar, 12-Factor | SQS + Lambda + Envoy/Istio | Resilience patterns confident ভাবে বলা |