System DesignMastery
--Advanced Topics — Expert Level

Observability: Logging, Metrics, Tracing

Duration৬০-৯০ মিনিট
LevelAdvanced
FocusOps & Monitoring
001Why Observability

Production এ যখন কিছু ভাঙে... রাত ৩টায় Alert

⏱ ৯০-১২০ মিনিট🔭 Advanced Level📊 ObservabilityTOPIC 02 / Phase 5

কল্পনা করুন রাত ৩টা। হঠাৎ PagerDuty alert: Payment service down। Users transaction করতে পারছে না। আপনার কাছে দুটো option:

❌ Without Observability

  • কোন service fail করেছেনে জানেন না
  • SSH করে manually log grep করুন
  • কোন machine-এ সমস্যা বুঝতে ঘণ্টার পর ঘণ্টা
  • "আমার machine-এ তো কাজ করে!"
  • MTTR (Mean Time To Recovery) = ঘণ্টা
  • Customer ইতিমধ্যে চলে গেছে

✅ With Observability

  • Dashboard এ দেখুন: payment-svc error rate 45%
  • Trace ID দিয়ে exact failed request খুঁজুন
  • DB connection pool exhausted — log-এ স্পষ্ট
  • Grafana: latency spike ঠিক কখন শুরু হয়েছে
  • Root cause 5 মিনিটে: DB slow query
  • MTTR = মিনিট, customer impact কম

📌 Observability কী?

Observability মানে system এর internal state বোঝা external outputs থেকে। Monitoring বলে "কী হচ্ছে", Observability বলে "কেন হচ্ছে"। আপনি যদি system-এর যেকোনো অজানা প্রশ্নের উত্তর data দিয়ে বের করতে পারেন — সেই system observable। এটা achieve করতে দরকার ৩টা pillar: Metrics, Logs, এবং Traces

002Three Pillars

Observability এর ৩টা Pillar — Metrics, Logs, Traces

একটা Request এর পুরো Observability Flow

USERRequestAPI GATEWAYtrace_id injectORDER SVCspan: 45msPAYMENT SVCspan: 120msDATABASEspan: 80msMETRICSreq_count, latencyLOGSstructured JSONTRACESdistributed spansTotal trace duration: 245ms (trace_id: abc-123)Metrics = aggregate numbers | Logs = event details | Traces = request journey

📊 Metrics

  • Aggregated numbers
  • CPU, memory, request count
  • Time-series data
  • Tool: Prometheus
  • Alert threshold set করা যায়

📝 Logs

  • Event-by-event records
  • Error messages, stack traces
  • Structured JSON format
  • Tool: ELK Stack
  • Full context per event

🔍 Traces

  • Request journey tracking
  • Service A → B → C latency
  • Distributed span tree
  • Tool: Jaeger / Zipkin
  • Root cause analysis
বৈশিষ্ট্যMetricsLogsTraces
কী?Numeric aggregatesText event recordsRequest flow map
ToolPrometheus + GrafanaELK Stack / LokiJaeger / Zipkin
GranularityLow (aggregate)Medium (per event)High (per request)
Storage costLowHighMedium
Use caseAlerting, dashboardsDebugging errorsLatency analysis
QueryPromQLKQL / LuceneTrace ID search
003Prometheus — Metrics

Prometheus — Metrics Collection এবং PromQL

Prometheus হলো open-source metrics system। এটা pull model use করে — services-এর /metrics endpoint থেকে নিজে data collect করে। এরপর AlertManager দিয়ে alert এবং Grafana দিয়ে visualize করা হয়।

Prometheus Architecture — Pull Model

API Service/metricsOrder Service/metricsPayment Service/metricsDB Exporter/metricsPULLevery 15sPROMETHEUSTSDB StoragePromQL EngineALERTMANAGERPagerDuty / SlackGRAFANADashboardsPrometheus scrapes /metrics endpoint
Metric Typeকী করেExampleকখন ব্যবহার
Counterশুধু বাড়ে, কমে নাhttp_requests_totalRequest count, error count
Gaugeবাড়ে এবং কমে উভয়memory_usage_bytesCPU, memory, active connections
HistogramBucketed observationshttp_request_duration_secondsLatency percentiles (P99)
SummaryPre-calculated quantilesrpc_duration_secondsClient-side percentile calculation
promql-examples.txt — PromQL Queries
# Error Rate — শেষ ৫ মিনিটে error percentage
rate(http_requests_total{status=~"5.."}[5m])
  /
rate(http_requests_total[5m]) * 100

# P99 Latency — ৯৯% requests এর latency
histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket[5m])
)

# Request Rate by Service — প্রতিটা service-এর RPS
sum(rate(http_requests_total[1m])) by (service)

# Active Connections per Pod
sum(tcp_connections_active) by (pod)

# Memory Usage Percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
  / node_memory_MemTotal_bytes * 100

# ───────────── Alert Rule Example ─────────────
# prometheus_rules.yml
groups:
  - name: payment-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{service="payment",status=~"5.."}[5m])
          /
          rate(http_requests_total{service="payment"}[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Payment service error rate > 5%"
          description: "Error rate: {{ $value | humanizePercentage }}"
004Grafana — Visualization

Grafana — Dashboards এবং SLI/SLO/SLA

Grafana হলো open-source visualization platform। এটা নিজে data store করে না — বরং Prometheus, Loki, Tempo, Elasticsearch সহ ৫০+ data source-এ connect করে beautiful dashboards বানায়।

📊 Grafana Data Sources

  • Prometheus → Metrics dashboards
  • Loki → Log visualization
  • Tempo / Jaeger → Trace viewer
  • Elasticsearch → Log search
  • MySQL / PostgreSQL → DB metrics
  • CloudWatch → AWS metrics

🎛️ Dashboard Panel Types

  • Time series → latency trends
  • Stat → current error rate %
  • Gauge → CPU / memory fill
  • Table → per-service summary
  • Heatmap → request distribution
  • Alert list → active incidents
grafana-dashboard-panel.json — Dashboard Config
{
  "title": "Payment Service — SLO Dashboard",
  "panels": [
    {
      "title": "Error Rate (5m)",
      "type": "stat",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{service='payment',status=~'5..'}[5m])) / sum(rate(http_requests_total{service='payment'}[5m])) * 100",
          "legendFormat": "Error %"
        }
      ],
      "thresholds": {
        "steps": [
          { "color": "green", "value": null },
          { "color": "yellow", "value": 1 },
          { "color": "red", "value": 5 }
        ]
      }
    },
    {
      "title": "P99 Latency",
      "type": "timeseries",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service='payment'}[5m]))",
          "legendFormat": "P99"
        },
        {
          "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service='payment'}[5m]))",
          "legendFormat": "P95"
        }
      ]
    },
    {
      "title": "Request Rate (RPS)",
      "type": "timeseries",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{service='payment'}[1m])) by (status_code)",
          "legendFormat": "{{ status_code }}"
        }
      ]
    }
  ]
}

💡 SLI, SLO, SLA — পার্থক্য কী?

SLI (Service Level Indicator): Actual measured metric। যেমন: payment service-এর current availability = 99.87%।

SLO (Service Level Objective): Internal target। যেমন: আমরা 99.9% availability maintain করতে চাই। এটা team-এর goal।

SLA (Service Level Agreement): External contract with customers। যেমন: আমরা 99.5% guarantee করি। Miss করলেন refund/penalty। SLO > SLA — buffer রাখুন।

005Jaeger — Distributed Tracing

Distributed Tracing — Jaeger দিয়ে Request Journey Track করুন

Microservices-এ একটা request ১০টা service পার করে। কোন service-এ কতটুকু সময় গেছে? Distributed tracing এই question-এর উত্তর দেয়। প্রতিটা request-এর জন্য একটা trace তৈরি হয়, প্রতিটা service call হলো একটা span

Trace Waterfall — trace_id: abc-xyz-789

0ms100ms200ms300msAPI Gateway280msOrder Svc150msDB Query80msPayment Svc100msPayment DB80msTrace = একটা request-এর পুরো journey। Span = প্রতিটা service call।Bottleneck দেখা যাচ্ছে: Payment DB 80ms — optimize করার জায়গা।
opentelemetry_tracing.py — OpenTelemetry Instrumentation
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from fastapi import FastAPI, Request
import httpx

# ──── Setup: Jaeger Exporter ────
jaeger_exporter = JaegerExporter(
    agent_host_name="jaeger",  # Docker service name
    agent_port=6831,
)

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(jaeger_exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("order-service")

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)  # Auto-instrument all routes

# ──── Manual Span: Custom business logic ────
@app.post("/order")
async def create_order(request: Request, item_id: str, quantity: int):
    # trace_id is automatically injected by FastAPIInstrumentor
    with tracer.start_as_current_span("validate-inventory") as span:
        span.set_attribute("item_id", item_id)
        span.set_attribute("quantity", quantity)

        # Check inventory — nested span
        with tracer.start_as_current_span("db-query-inventory"):
            available = await check_inventory(item_id)
            if available < quantity:
                span.set_status(trace.StatusCode.ERROR, "insufficient stock")
                return {"error": "out of stock"}

    # Call Payment Service — trace_id propagates via HTTP header
    with tracer.start_as_current_span("call-payment-service") as pay_span:
        pay_span.set_attribute("amount", quantity * 100)
        async with httpx.AsyncClient() as client:
            # OpenTelemetry automatically injects traceparent header
            response = await client.post(
                "http://payment-svc/charge",
                json={"item_id": item_id, "quantity": quantity},
                headers=get_trace_headers()  # Propagate context
            )

    return {"order_id": "ord-123", "status": "confirmed"}


# ──── Trace Context Propagation ────
def get_trace_headers() -> dict:
    """Extract current trace context as HTTP headers for downstream services"""
    from opentelemetry.propagate import inject
    headers = {}
    inject(headers)  # Adds 'traceparent' header: 00-{trace_id}-{span_id}-01
    return headers

# traceparent example:
# traceparent: 00-abc123def456-789abc-01
#              ver  trace_id    span_id flags

📌 OpenTelemetry কী?

OpenTelemetry (OTel) হলো vendor-neutral observability framework। এটা CNCF project। একবার instrument করুন, যেকোনো backend-এ export করুন — Jaeger, Zipkin, Datadog, New Relic। Trace ID HTTP header traceparent হিসেবে propagate হয়। প্রতিটা downstream service এই header read করে parent span-এ attach করে।

006ELK Stack — Centralized Logging

ELK Stack — Elasticsearch, Logstash, Kibana দিয়ে Log Management

Hundreds of services-এর logs আলাদা আলাদা machine-এ থাকলে debug করা nightmare। ELK Stack সব logs এক জায়গায় এনে searchable করে তোলে। Elasticsearch + Logstash + Kibana + Beats (Filebeat) মিলে complete log pipeline তৈরি করে।

ELK Log Pipeline — App থেকে Kibana পর্যন্ত

Applicationwrites logs/var/log/*.logFILEBEATtail log filelightweight agentLOGSTASHParse & filterTransformEnrich metadataELASTICSEARCHIndex & storeFull-text searchInverted indexKIBANASearch UIDashboardsServiceBeats familyPipelineStorage + SearchVisualization
structured_logging.py — JSON Structured Log Format
import logging
import json
import time
import uuid
from contextvars import ContextVar

# Correlation ID — request-এর পুরো journey track করতে
_correlation_id: ContextVar[str] = ContextVar('correlation_id', default='')

class StructuredLogger:
    def __init__(self, service_name: str):
        self.service = service_name

    def _log(self, level: str, message: str, **extra):
        record = {
            "timestamp": time.strftime("%Y-%m-%dT%H:%M:%S.000Z", time.gmtime()),
            "level": level,
            "service": self.service,
            "correlation_id": _correlation_id.get(),  # Track across services
            "message": message,
            **extra  # Additional context fields
        }
        print(json.dumps(record))  # ELK collects stdout

    def info(self, msg: str, **kw): self._log("INFO", msg, **kw)
    def error(self, msg: str, **kw): self._log("ERROR", msg, **kw)
    def warn(self, msg: str, **kw): self._log("WARN", msg, **kw)


# Usage example
logger = StructuredLogger("payment-service")

def process_payment(user_id: str, amount: float, request_id: str):
    _correlation_id.set(request_id)  # Set for this request lifecycle

    logger.info("payment_started",
        user_id=user_id,
        amount=amount,
        currency="BDT"
    )

    try:
        result = charge_card(user_id, amount)
        logger.info("payment_success",
            user_id=user_id,
            amount=amount,
            transaction_id=result.txn_id,
            duration_ms=result.duration
        )
        return result
    except Exception as e:
        logger.error("payment_failed",
            user_id=user_id,
            amount=amount,
            error=str(e),
            error_type=type(e).__name__
        )
        raise

# Output (ELK ingests this JSON):
# {"timestamp":"2026-05-08T03:15:22.000Z","level":"ERROR",
#  "service":"payment-service","correlation_id":"req-abc-123",
#  "message":"payment_failed","user_id":"u-456","amount":1500.0,
#  "error":"DB connection timeout","error_type":"TimeoutError"}

⚠️ Log Cardinality এবং Log Levels

Log Cardinality সমস্যা: User ID, Order ID, Trace ID — এগুলো log field হিসেবে ঠিক আছে। কিন্তু IP address বা timestamp কে field key বানালেন Elasticsearch-এর mapping explosion হয়। High cardinality fields value-এ রাখুন, key-তে নয়।

Log Levels: Production-এ INFO বা WARN রাখুন। DEBUG log production-এ enable থাকলে storage cost ১০x বাড়তে পারে। Sampling strategy use করুন।

007Production Code Examples

Production Code — Metrics + Logs + Traces একসাথে

Real production code-এ তিনটা pillar একসাথে থাকে। নিচের example-এ Node.js service-এ Prometheus metrics, structured logging, এবং OpenTelemetry trace — সবই একই request handling-এ integrate করা হয়েছে।

orderService.js — Node.js Full Observability
const express = require('express');
const { MeterProvider } = require('@opentelemetry/sdk-metrics');
const { trace, context } = require('@opentelemetry/api');
const promClient = require('prom-client');

const app = express();

// ─── 1. METRICS — Prometheus ───
const register = new promClient.Registry();
promClient.collectDefaultMetrics({ register });

const httpRequestCounter = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
  registers: [register],
});

const httpLatency = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request latency in seconds',
  labelNames: ['method', 'route'],
  buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5],
  registers: [register],
});

// ─── 2. STRUCTURED LOGGING ───
function log(level, message, fields = {}) {
  const span = trace.getActiveSpan();
  const spanCtx = span?.spanContext();

  const entry = {
    timestamp: new Date().toISOString(),
    level,
    service: 'order-service',
    message,
    trace_id: spanCtx?.traceId || '',   // Link log to trace
    span_id: spanCtx?.spanId || '',
    ...fields,
  };
  console.log(JSON.stringify(entry));   // ELK / Loki ingests stdout
}

// ─── 3. TRACES — OpenTelemetry ───
const tracer = trace.getTracer('order-service', '1.0.0');

// ─── Request Handler ───
app.post('/orders', async (req, res) => {
  const startTime = Date.now();
  const { userId, items } = req.body;

  // Span: entire request
  return tracer.startActiveSpan('create-order', async (span) => {
    try {
      span.setAttributes({
        'user.id': userId,
        'order.item_count': items.length,
      });

      log('INFO', 'order_request_received', {
        user_id: userId,
        item_count: items.length,
      });

      // Nested span: validate
      const inventory = await tracer.startActiveSpan('validate-inventory',
        async (invSpan) => {
          const result = await checkInventory(items);
          invSpan.setAttributes({ 'inventory.available': result.available });
          invSpan.end();
          return result;
        }
      );

      if (!inventory.available) {
        span.setStatus({ code: 2, message: 'out_of_stock' });
        log('WARN', 'order_rejected', { reason: 'out_of_stock', user_id: userId });
        res.status(400).json({ error: 'out_of_stock' });
        return;
      }

      // Nested span: save to DB
      const order = await tracer.startActiveSpan('save-order-db',
        async (dbSpan) => {
          const saved = await db.save({ userId, items, status: 'pending' });
          dbSpan.setAttribute('db.order_id', saved.id);
          dbSpan.end();
          return saved;
        }
      );

      const durationMs = Date.now() - startTime;
      log('INFO', 'order_created', {
        user_id: userId,
        order_id: order.id,
        duration_ms: durationMs,
      });

      // Record metrics
      httpRequestCounter.inc({ method: 'POST', route: '/orders', status_code: 201 });
      httpLatency.observe({ method: 'POST', route: '/orders' }, durationMs / 1000);

      span.setStatus({ code: 1 });
      res.status(201).json({ orderId: order.id });

    } catch (err) {
      span.recordException(err);
      span.setStatus({ code: 2, message: err.message });

      log('ERROR', 'order_creation_failed', {
        user_id: userId,
        error: err.message,
        stack: err.stack,
      });

      httpRequestCounter.inc({ method: 'POST', route: '/orders', status_code: 500 });
      res.status(500).json({ error: 'internal_error' });
    } finally {
      span.end();
    }
  });
});

// ─── Prometheus scrape endpoint ───
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.listen(3000);

💡 Correlation ID — ৩টা Pillar Connect করার চাবিকাঠি

Log-এ trace_id include করলেন Kibana-তে log দেখতে দেখতে সরাসরি Jaeger-এ trace jump করা যায়। Grafana-তে latency spike দেখলে সেই exact timestamp-এর trace খুঁজুন — কোন span সবচেয়ে বেশি সময় নিয়েছে বোঝা যাবেন। এটাই observability-র real power।

008Real World & Interview Tips

Real World Use Cases এবং Interview Tips

CompanyMetricsLogsTracingবিশেষত্ব
NetflixAtlas (custom)ElasticsearchZipkinChaos engineering + observability integrated
UberM3 (custom)ELK StackJaeger (created by Uber)Jaeger open-sourced করেছেনে Uber
GoogleMonarchCloud LoggingDapper (invented tracing)Dapper paper — distributed tracing এর origin
FacebookODSScribeCanopyBillions of events/sec log processing
AirbnbPrometheusELKJaegerStandard OSS stack, heavily customized
STEP 01[object Object]

Interview Question: "System কীভাবে monitor করবেন?"

সবার আগে বলুন: "Observability এর ৩টা pillar দিয়ে approach করবো — Metrics, Logs, Traces।" তারপর প্রতিটা layer explain করুন। এই structure দেখলে interviewer বুঝবেন আপনি production experience আছে।

STEP 02[object Object]

Metrics প্রথমে — "কী হচ্ছে" জানেন

Prometheus + Grafana। Key metrics mention করুন: error rate, P99 latency, RPS, CPU/memory। Golden Signals: Latency, Traffic, Errors, Saturation (LETS) বলুন — Google SRE book থেকে।

STEP 03[object Object]

Logs দিয়ে "কেন হচ্ছে" বোঝো

ELK Stack। Structured logging (JSON) এর গুরুত্ব explain করুন। Correlation ID দিয়ে request trace করার কথা বলুন।

STEP 04[object Object]

Traces দিয়ে "কোথায় হচ্ছে" নির্দিষ্ট করুন

Jaeger + OpenTelemetry। Distributed system-এ কোন service bottleneck সেটা trace waterfall দিয়ে দেখা যায়। P99 latency অনেক বেশি হলে trace দেখুন।

STEP 05[object Object]

Alerting Strategy — Symptoms-based Alert

Alert on symptoms, not causes। "Error rate > 5%" alert করুন, "CPU > 80%" নয়। CPU high হওয়া মানে user impact নেই, কিন্তু error rate high মানে users ক্ষতিগ্রস্ত।

💡 Alerting Best Practices

Alert on symptoms, not causes: "Payment error rate > 5%" alert করুন। "DB CPU > 90%" alert করুন না — DB CPU high হতে পারে কিন্তু users affect না হতেও পারে।

Alert fatigue এড়াও: বেশি alert = কম attention। Critical alert-এ সাড়া দেওয়ার culture নষ্ট হয়। প্রতিটা alert actionable হতে হবে।

Error Budget: SLO miss করলেন alert। যেমন: 99.9% availability = month-এ মাত্র ৪৩ মিনিট downtime budget। এই budget শেষ হলে feature deployment বন্ধ করুন।

Runbook link: প্রতিটা alert-এ runbook link রাখুন। রাত ৩টায় alert পেলে কী করতে হবে — step-by-step documented।

Observability Stack — Quick Reference

Metrics
PrometheusGrafanaThanos (long-term)
Logs
ElasticsearchLogstashKibanaFilebeatLoki (lightweight)
Traces
JaegerZipkinTempoOpenTelemetry SDK
Alerting
AlertManagerPagerDutyOpsGenieSlack
009Lesson Summary

SUMMARY — আজকে যা শিখলাম

PillarToolকী জানায়Query Languageসেরা Use Case
MetricsPrometheus + GrafanaAggregated numbers over timePromQLAlerting, SLO tracking, capacity planning
LogsELK Stack / LokiDetailed event recordsKQL / LogQLError debugging, audit trail
TracesJaeger + OpenTelemetryRequest journey across servicesTrace ID searchLatency bottleneck, microservice debugging
AlertingAlertManager + PagerDutyProactive incident notificationAlert rules (YAML)On-call, incident response
010Knowledge Check
011Assignments
012Practical Lab