>_ Golang Step By Step
Staff Software Engineer

Observability

Logging, metrics, tracing — understanding systems in production

# The Three Pillars

Observability answers: "Why is the system behaving this way?"It's not just monitoring (knowing that something is wrong) — it's about understanding why.

PillarWhatTools
LogsDiscrete events with contextELK, Loki, CloudWatch
MetricsNumeric measurements over timePrometheus, Datadog, CloudWatch
TracesEnd-to-end request journeyJaeger, Zipkin, X-Ray, Tempo

# Distributed Tracing

A single user request might touch 10+ services. Tracing follows that journey, showing exactly where time is spent.

Trace: user clicks "Buy Now"
TraceID: abc-123-def

├── [API Gateway]    0ms ─────────────────────── 250ms
│   ├── [Auth Service]     5ms ──── 15ms
│   ├── [Order Service]    20ms ──────────────── 230ms
│   │   ├── [Inventory DB]   25ms ───── 45ms
│   │   ├── [Payment Svc]    50ms ────────── 180ms   ← BOTTLENECK
│   │   │   └── [Stripe API]   55ms ────── 170ms
│   │   └── [Email Queue]    185ms ─ 190ms
│   └── [Response]          235ms ─ 250ms

Each bar = a "span" (service + operation + duration)
Spans are nested (parent → child)
trace_id propagated via HTTP header: traceparent

# SLOs, SLAs & Error Budgets

SLOs turn "the system should be reliable" into a measurable target.

SLI (Indicator):  "99.2% of requests complete in < 200ms"
                   ↓ (want to achieve)
SLO (Objective):  "99.5% of requests must complete in < 200ms"
                   ↓ (promise to customers)
SLA (Agreement):  "99.9% uptime, or we refund 10%"

Error Budget = 100% - SLO
  SLO: 99.9% → Error budget: 0.1% → 43 min downtime/month

Error budget exhausted?
  → Freeze features, focus on reliability
  → This aligns engineering velocity with reliability goals

# Alerting That Doesn't Suck

Bad alerting: "CPU over 80%" at 3 AM for a non-issue. Good alerting: symptom-based, tied to SLOs, with clear runbooks.

  • Alert on symptoms, not causes — Alert: "error rate > 1%" not "CPU > 80%"
  • Use burn rate alerts — How fast are we spending our error budget?
  • Every alert needs a runbook — What to check, who to escalate to
  • Reduce noise — If you ignore an alert repeatedly, fix or delete it

⚡ Key Takeaways

  • Three pillars: Logs (what happened), Metrics (how much), Traces (where)
  • Use RED (Rate, Errors, Duration) for services and USE (Utilization, Saturation, Errors) for resources
  • SLOs + error budgets align feature velocity with reliability goals
  • Distributed tracing is essential for debugging microservice latency
  • OpenTelemetry is the vendor-neutral standard — learn it once, export anywhere
practice & review