Observability: Logging, Metrics, Tracing
Production এ যখন কিছু ভাঙে... রাত ৩টায় Alert
কল্পনা করুন রাত ৩টা। হঠাৎ PagerDuty alert: Payment service down। Users transaction করতে পারছে না। আপনার কাছে দুটো option:
❌ Without Observability
- →কোন service fail করেছেনে জানেন না
- →SSH করে manually log grep করুন
- →কোন machine-এ সমস্যা বুঝতে ঘণ্টার পর ঘণ্টা
- →"আমার machine-এ তো কাজ করে!"
- →MTTR (Mean Time To Recovery) = ঘণ্টা
- →Customer ইতিমধ্যে চলে গেছে
✅ With Observability
- →Dashboard এ দেখুন: payment-svc error rate 45%
- →Trace ID দিয়ে exact failed request খুঁজুন
- →DB connection pool exhausted — log-এ স্পষ্ট
- →Grafana: latency spike ঠিক কখন শুরু হয়েছে
- →Root cause 5 মিনিটে: DB slow query
- →MTTR = মিনিট, customer impact কম
📌 Observability কী?
Observability মানে system এর internal state বোঝা external outputs থেকে। Monitoring বলে "কী হচ্ছে", Observability বলে "কেন হচ্ছে"। আপনি যদি system-এর যেকোনো অজানা প্রশ্নের উত্তর data দিয়ে বের করতে পারেন — সেই system observable। এটা achieve করতে দরকার ৩টা pillar: Metrics, Logs, এবং Traces।
Observability এর ৩টা Pillar — Metrics, Logs, Traces
একটা Request এর পুরো Observability Flow
📊 Metrics
- →Aggregated numbers
- →CPU, memory, request count
- →Time-series data
- →Tool: Prometheus
- →Alert threshold set করা যায়
📝 Logs
- →Event-by-event records
- →Error messages, stack traces
- →Structured JSON format
- →Tool: ELK Stack
- →Full context per event
🔍 Traces
- →Request journey tracking
- →Service A → B → C latency
- →Distributed span tree
- →Tool: Jaeger / Zipkin
- →Root cause analysis
| বৈশিষ্ট্য | Metrics | Logs | Traces |
|---|---|---|---|
| কী? | Numeric aggregates | Text event records | Request flow map |
| Tool | Prometheus + Grafana | ELK Stack / Loki | Jaeger / Zipkin |
| Granularity | Low (aggregate) | Medium (per event) | High (per request) |
| Storage cost | Low | High | Medium |
| Use case | Alerting, dashboards | Debugging errors | Latency analysis |
| Query | PromQL | KQL / Lucene | Trace ID search |
Prometheus — Metrics Collection এবং PromQL
Prometheus হলো open-source metrics system। এটা pull model use করে — services-এর /metrics endpoint থেকে নিজে data collect করে। এরপর AlertManager দিয়ে alert এবং Grafana দিয়ে visualize করা হয়।
Prometheus Architecture — Pull Model
| Metric Type | কী করে | Example | কখন ব্যবহার |
|---|---|---|---|
| Counter | শুধু বাড়ে, কমে না | http_requests_total | Request count, error count |
| Gauge | বাড়ে এবং কমে উভয় | memory_usage_bytes | CPU, memory, active connections |
| Histogram | Bucketed observations | http_request_duration_seconds | Latency percentiles (P99) |
| Summary | Pre-calculated quantiles | rpc_duration_seconds | Client-side percentile calculation |
# Error Rate — শেষ ৫ মিনিটে error percentage
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m]) * 100
# P99 Latency — ৯৯% requests এর latency
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
)
# Request Rate by Service — প্রতিটা service-এর RPS
sum(rate(http_requests_total[1m])) by (service)
# Active Connections per Pod
sum(tcp_connections_active) by (pod)
# Memory Usage Percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100
# ───────────── Alert Rule Example ─────────────
# prometheus_rules.yml
groups:
- name: payment-alerts
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{service="payment",status=~"5.."}[5m])
/
rate(http_requests_total{service="payment"}[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "Payment service error rate > 5%"
description: "Error rate: {{ $value | humanizePercentage }}"Grafana — Dashboards এবং SLI/SLO/SLA
Grafana হলো open-source visualization platform। এটা নিজে data store করে না — বরং Prometheus, Loki, Tempo, Elasticsearch সহ ৫০+ data source-এ connect করে beautiful dashboards বানায়।
📊 Grafana Data Sources
- →Prometheus → Metrics dashboards
- →Loki → Log visualization
- →Tempo / Jaeger → Trace viewer
- →Elasticsearch → Log search
- →MySQL / PostgreSQL → DB metrics
- →CloudWatch → AWS metrics
🎛️ Dashboard Panel Types
- →Time series → latency trends
- →Stat → current error rate %
- →Gauge → CPU / memory fill
- →Table → per-service summary
- →Heatmap → request distribution
- →Alert list → active incidents
{
"title": "Payment Service — SLO Dashboard",
"panels": [
{
"title": "Error Rate (5m)",
"type": "stat",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(rate(http_requests_total{service='payment',status=~'5..'}[5m])) / sum(rate(http_requests_total{service='payment'}[5m])) * 100",
"legendFormat": "Error %"
}
],
"thresholds": {
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 5 }
]
}
},
{
"title": "P99 Latency",
"type": "timeseries",
"datasource": "Prometheus",
"targets": [
{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service='payment'}[5m]))",
"legendFormat": "P99"
},
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service='payment'}[5m]))",
"legendFormat": "P95"
}
]
},
{
"title": "Request Rate (RPS)",
"type": "timeseries",
"targets": [
{
"expr": "sum(rate(http_requests_total{service='payment'}[1m])) by (status_code)",
"legendFormat": "{{ status_code }}"
}
]
}
]
}💡 SLI, SLO, SLA — পার্থক্য কী?
SLI (Service Level Indicator): Actual measured metric। যেমন: payment service-এর current availability = 99.87%।
SLO (Service Level Objective): Internal target। যেমন: আমরা 99.9% availability maintain করতে চাই। এটা team-এর goal।
SLA (Service Level Agreement): External contract with customers। যেমন: আমরা 99.5% guarantee করি। Miss করলেন refund/penalty। SLO > SLA — buffer রাখুন।
Distributed Tracing — Jaeger দিয়ে Request Journey Track করুন
Microservices-এ একটা request ১০টা service পার করে। কোন service-এ কতটুকু সময় গেছে? Distributed tracing এই question-এর উত্তর দেয়। প্রতিটা request-এর জন্য একটা trace তৈরি হয়, প্রতিটা service call হলো একটা span।
Trace Waterfall — trace_id: abc-xyz-789
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from fastapi import FastAPI, Request
import httpx
# ──── Setup: Jaeger Exporter ────
jaeger_exporter = JaegerExporter(
agent_host_name="jaeger", # Docker service name
agent_port=6831,
)
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(jaeger_exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("order-service")
app = FastAPI()
FastAPIInstrumentor.instrument_app(app) # Auto-instrument all routes
# ──── Manual Span: Custom business logic ────
@app.post("/order")
async def create_order(request: Request, item_id: str, quantity: int):
# trace_id is automatically injected by FastAPIInstrumentor
with tracer.start_as_current_span("validate-inventory") as span:
span.set_attribute("item_id", item_id)
span.set_attribute("quantity", quantity)
# Check inventory — nested span
with tracer.start_as_current_span("db-query-inventory"):
available = await check_inventory(item_id)
if available < quantity:
span.set_status(trace.StatusCode.ERROR, "insufficient stock")
return {"error": "out of stock"}
# Call Payment Service — trace_id propagates via HTTP header
with tracer.start_as_current_span("call-payment-service") as pay_span:
pay_span.set_attribute("amount", quantity * 100)
async with httpx.AsyncClient() as client:
# OpenTelemetry automatically injects traceparent header
response = await client.post(
"http://payment-svc/charge",
json={"item_id": item_id, "quantity": quantity},
headers=get_trace_headers() # Propagate context
)
return {"order_id": "ord-123", "status": "confirmed"}
# ──── Trace Context Propagation ────
def get_trace_headers() -> dict:
"""Extract current trace context as HTTP headers for downstream services"""
from opentelemetry.propagate import inject
headers = {}
inject(headers) # Adds 'traceparent' header: 00-{trace_id}-{span_id}-01
return headers
# traceparent example:
# traceparent: 00-abc123def456-789abc-01
# ver trace_id span_id flags📌 OpenTelemetry কী?
OpenTelemetry (OTel) হলো vendor-neutral observability framework। এটা CNCF project। একবার instrument করুন, যেকোনো backend-এ export করুন — Jaeger, Zipkin, Datadog, New Relic। Trace ID HTTP header traceparent হিসেবে propagate হয়। প্রতিটা downstream service এই header read করে parent span-এ attach করে।
ELK Stack — Elasticsearch, Logstash, Kibana দিয়ে Log Management
Hundreds of services-এর logs আলাদা আলাদা machine-এ থাকলে debug করা nightmare। ELK Stack সব logs এক জায়গায় এনে searchable করে তোলে। Elasticsearch + Logstash + Kibana + Beats (Filebeat) মিলে complete log pipeline তৈরি করে।
ELK Log Pipeline — App থেকে Kibana পর্যন্ত
import logging
import json
import time
import uuid
from contextvars import ContextVar
# Correlation ID — request-এর পুরো journey track করতে
_correlation_id: ContextVar[str] = ContextVar('correlation_id', default='')
class StructuredLogger:
def __init__(self, service_name: str):
self.service = service_name
def _log(self, level: str, message: str, **extra):
record = {
"timestamp": time.strftime("%Y-%m-%dT%H:%M:%S.000Z", time.gmtime()),
"level": level,
"service": self.service,
"correlation_id": _correlation_id.get(), # Track across services
"message": message,
**extra # Additional context fields
}
print(json.dumps(record)) # ELK collects stdout
def info(self, msg: str, **kw): self._log("INFO", msg, **kw)
def error(self, msg: str, **kw): self._log("ERROR", msg, **kw)
def warn(self, msg: str, **kw): self._log("WARN", msg, **kw)
# Usage example
logger = StructuredLogger("payment-service")
def process_payment(user_id: str, amount: float, request_id: str):
_correlation_id.set(request_id) # Set for this request lifecycle
logger.info("payment_started",
user_id=user_id,
amount=amount,
currency="BDT"
)
try:
result = charge_card(user_id, amount)
logger.info("payment_success",
user_id=user_id,
amount=amount,
transaction_id=result.txn_id,
duration_ms=result.duration
)
return result
except Exception as e:
logger.error("payment_failed",
user_id=user_id,
amount=amount,
error=str(e),
error_type=type(e).__name__
)
raise
# Output (ELK ingests this JSON):
# {"timestamp":"2026-05-08T03:15:22.000Z","level":"ERROR",
# "service":"payment-service","correlation_id":"req-abc-123",
# "message":"payment_failed","user_id":"u-456","amount":1500.0,
# "error":"DB connection timeout","error_type":"TimeoutError"}⚠️ Log Cardinality এবং Log Levels
Log Cardinality সমস্যা: User ID, Order ID, Trace ID — এগুলো log field হিসেবে ঠিক আছে। কিন্তু IP address বা timestamp কে field key বানালেন Elasticsearch-এর mapping explosion হয়। High cardinality fields value-এ রাখুন, key-তে নয়।
Log Levels: Production-এ INFO বা WARN রাখুন। DEBUG log production-এ enable থাকলে storage cost ১০x বাড়তে পারে। Sampling strategy use করুন।
Production Code — Metrics + Logs + Traces একসাথে
Real production code-এ তিনটা pillar একসাথে থাকে। নিচের example-এ Node.js service-এ Prometheus metrics, structured logging, এবং OpenTelemetry trace — সবই একই request handling-এ integrate করা হয়েছে।
const express = require('express');
const { MeterProvider } = require('@opentelemetry/sdk-metrics');
const { trace, context } = require('@opentelemetry/api');
const promClient = require('prom-client');
const app = express();
// ─── 1. METRICS — Prometheus ───
const register = new promClient.Registry();
promClient.collectDefaultMetrics({ register });
const httpRequestCounter = new promClient.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status_code'],
registers: [register],
});
const httpLatency = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request latency in seconds',
labelNames: ['method', 'route'],
buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5],
registers: [register],
});
// ─── 2. STRUCTURED LOGGING ───
function log(level, message, fields = {}) {
const span = trace.getActiveSpan();
const spanCtx = span?.spanContext();
const entry = {
timestamp: new Date().toISOString(),
level,
service: 'order-service',
message,
trace_id: spanCtx?.traceId || '', // Link log to trace
span_id: spanCtx?.spanId || '',
...fields,
};
console.log(JSON.stringify(entry)); // ELK / Loki ingests stdout
}
// ─── 3. TRACES — OpenTelemetry ───
const tracer = trace.getTracer('order-service', '1.0.0');
// ─── Request Handler ───
app.post('/orders', async (req, res) => {
const startTime = Date.now();
const { userId, items } = req.body;
// Span: entire request
return tracer.startActiveSpan('create-order', async (span) => {
try {
span.setAttributes({
'user.id': userId,
'order.item_count': items.length,
});
log('INFO', 'order_request_received', {
user_id: userId,
item_count: items.length,
});
// Nested span: validate
const inventory = await tracer.startActiveSpan('validate-inventory',
async (invSpan) => {
const result = await checkInventory(items);
invSpan.setAttributes({ 'inventory.available': result.available });
invSpan.end();
return result;
}
);
if (!inventory.available) {
span.setStatus({ code: 2, message: 'out_of_stock' });
log('WARN', 'order_rejected', { reason: 'out_of_stock', user_id: userId });
res.status(400).json({ error: 'out_of_stock' });
return;
}
// Nested span: save to DB
const order = await tracer.startActiveSpan('save-order-db',
async (dbSpan) => {
const saved = await db.save({ userId, items, status: 'pending' });
dbSpan.setAttribute('db.order_id', saved.id);
dbSpan.end();
return saved;
}
);
const durationMs = Date.now() - startTime;
log('INFO', 'order_created', {
user_id: userId,
order_id: order.id,
duration_ms: durationMs,
});
// Record metrics
httpRequestCounter.inc({ method: 'POST', route: '/orders', status_code: 201 });
httpLatency.observe({ method: 'POST', route: '/orders' }, durationMs / 1000);
span.setStatus({ code: 1 });
res.status(201).json({ orderId: order.id });
} catch (err) {
span.recordException(err);
span.setStatus({ code: 2, message: err.message });
log('ERROR', 'order_creation_failed', {
user_id: userId,
error: err.message,
stack: err.stack,
});
httpRequestCounter.inc({ method: 'POST', route: '/orders', status_code: 500 });
res.status(500).json({ error: 'internal_error' });
} finally {
span.end();
}
});
});
// ─── Prometheus scrape endpoint ───
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
app.listen(3000);💡 Correlation ID — ৩টা Pillar Connect করার চাবিকাঠি
Log-এ trace_id include করলেন Kibana-তে log দেখতে দেখতে সরাসরি Jaeger-এ trace jump করা যায়। Grafana-তে latency spike দেখলে সেই exact timestamp-এর trace খুঁজুন — কোন span সবচেয়ে বেশি সময় নিয়েছে বোঝা যাবেন। এটাই observability-র real power।
Real World Use Cases এবং Interview Tips
| Company | Metrics | Logs | Tracing | বিশেষত্ব |
|---|---|---|---|---|
| Netflix | Atlas (custom) | Elasticsearch | Zipkin | Chaos engineering + observability integrated |
| Uber | M3 (custom) | ELK Stack | Jaeger (created by Uber) | Jaeger open-sourced করেছেনে Uber |
| Monarch | Cloud Logging | Dapper (invented tracing) | Dapper paper — distributed tracing এর origin | |
| ODS | Scribe | Canopy | Billions of events/sec log processing | |
| Airbnb | Prometheus | ELK | Jaeger | Standard OSS stack, heavily customized |
Interview Question: "System কীভাবে monitor করবেন?"
সবার আগে বলুন: "Observability এর ৩টা pillar দিয়ে approach করবো — Metrics, Logs, Traces।" তারপর প্রতিটা layer explain করুন। এই structure দেখলে interviewer বুঝবেন আপনি production experience আছে।
Metrics প্রথমে — "কী হচ্ছে" জানেন
Prometheus + Grafana। Key metrics mention করুন: error rate, P99 latency, RPS, CPU/memory। Golden Signals: Latency, Traffic, Errors, Saturation (LETS) বলুন — Google SRE book থেকে।
Logs দিয়ে "কেন হচ্ছে" বোঝো
ELK Stack। Structured logging (JSON) এর গুরুত্ব explain করুন। Correlation ID দিয়ে request trace করার কথা বলুন।
Traces দিয়ে "কোথায় হচ্ছে" নির্দিষ্ট করুন
Jaeger + OpenTelemetry। Distributed system-এ কোন service bottleneck সেটা trace waterfall দিয়ে দেখা যায়। P99 latency অনেক বেশি হলে trace দেখুন।
Alerting Strategy — Symptoms-based Alert
Alert on symptoms, not causes। "Error rate > 5%" alert করুন, "CPU > 80%" নয়। CPU high হওয়া মানে user impact নেই, কিন্তু error rate high মানে users ক্ষতিগ্রস্ত।
💡 Alerting Best Practices
Alert on symptoms, not causes: "Payment error rate > 5%" alert করুন। "DB CPU > 90%" alert করুন না — DB CPU high হতে পারে কিন্তু users affect না হতেও পারে।
Alert fatigue এড়াও: বেশি alert = কম attention। Critical alert-এ সাড়া দেওয়ার culture নষ্ট হয়। প্রতিটা alert actionable হতে হবে।
Error Budget: SLO miss করলেন alert। যেমন: 99.9% availability = month-এ মাত্র ৪৩ মিনিট downtime budget। এই budget শেষ হলে feature deployment বন্ধ করুন।
Runbook link: প্রতিটা alert-এ runbook link রাখুন। রাত ৩টায় alert পেলে কী করতে হবে — step-by-step documented।
Observability Stack — Quick Reference
SUMMARY — আজকে যা শিখলাম
| Pillar | Tool | কী জানায় | Query Language | সেরা Use Case |
|---|---|---|---|---|
| Metrics | Prometheus + Grafana | Aggregated numbers over time | PromQL | Alerting, SLO tracking, capacity planning |
| Logs | ELK Stack / Loki | Detailed event records | KQL / LogQL | Error debugging, audit trail |
| Traces | Jaeger + OpenTelemetry | Request journey across services | Trace ID search | Latency bottleneck, microservice debugging |
| Alerting | AlertManager + PagerDuty | Proactive incident notification | Alert rules (YAML) | On-call, incident response |