Prometheus and Grafana: Complete Monitoring Setup Guide for 2026

Why Prometheus and Grafana?

Prometheus and Grafana have become the de facto standard for open-source monitoring. Prometheus scrapes and stores time-series metrics. Grafana visualises them. Together they give you a monitoring stack that rivals expensive SaaS tools at a fraction of the cost — and with much more flexibility.

This guide walks you through a production-ready setup, from first installation to your first alerts firing.

Architecture Overview

Before installing anything, understand how the pieces fit together:

▹Prometheus — scrapes metrics from your applications and infrastructure on a pull model, stores them as time-series data, evaluates alerting rules
▹Alertmanager — receives alerts from Prometheus and routes them to the right destination (Slack, PagerDuty, email)
▹Grafana — connects to Prometheus as a data source and renders dashboards
▹Exporters — small programs that expose metrics in Prometheus format (Node Exporter for Linux servers, kube-state-metrics for Kubernetes, etc.)

Installation on Kubernetes (Recommended)

The easiest way to install the full stack on Kubernetes is the kube-prometheus-stack Helm chart. It installs Prometheus, Alertmanager, Grafana, and all the exporters you need in one command.

# Add the Prometheus community Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install the full stack in a monitoring namespace
helm install kube-prometheus-stack   prometheus-community/kube-prometheus-stack   --namespace monitoring   --create-namespace   --set grafana.adminPassword=your-secure-password   --set prometheus.prometheusSpec.retention=30d   --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

This gives you: Prometheus with 30-day retention, Grafana with default dashboards, Node Exporter on every node, kube-state-metrics for Kubernetes object metrics, and Alertmanager ready to configure.

Instrumenting Your Application

Out-of-the-box, Prometheus monitors your infrastructure. To monitor your application, you need to expose metrics from it.

Node.js Example

const client = require('prom-client')

// Create a Registry
const register = new client.Registry()

// Add default metrics (CPU, memory, event loop lag)
client.collectDefaultMetrics({ register })

// Custom metric: HTTP request duration
const httpDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5],
  registers: [register],
})

// Middleware to track every request
app.use((req, res, next) => {
  const end = httpDuration.startTimer()
  res.on('finish', () => {
    end({
      method: req.method,
      route: req.route?.path || req.path,
      status_code: res.statusCode,
    })
  })
  next()
})

// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType)
  res.end(await register.metrics())
})

Python (FastAPI) Example

from prometheus_fastapi_instrumentator import Instrumentator

app = FastAPI()

# One line to add metrics to a FastAPI app
Instrumentator().instrument(app).expose(app)

Tell Prometheus to Scrape Your App

Add a ServiceMonitor (if using kube-prometheus-stack):

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

Essential Grafana Dashboards

The kube-prometheus-stack includes pre-built dashboards, but these are the ones you should set up immediately:

1. Kubernetes Cluster Overview

Dashboard ID: 315 — import from grafana.com. Shows node CPU, memory, pod counts, and cluster-level resource usage.

2. Node Exporter Full

Dashboard ID: 1860 — detailed per-node metrics: CPU steal time, disk I/O, network throughput, memory pressure.

3. Kubernetes Deployment

Dashboard ID: 8588 — per-deployment metrics: replicas desired vs. available, pod restarts, resource requests vs. limits.

4. Your Application Dashboard (Build This)

Create a custom dashboard with 4 panels for the golden signals:

Panel 1 — Request Rate

rate(http_request_duration_seconds_count[5m])

Panel 2 — Error Rate

rate(http_request_duration_seconds_count{status_code=~"5.."}[5m])
/
rate(http_request_duration_seconds_count[5m])

Panel 3 — Latency (p99)

histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket[5m])
)

Panel 4 — Saturation (Pod CPU)

rate(container_cpu_usage_seconds_total{container="my-app"}[5m])

Setting Up Alerting

Configure Alertmanager for Slack

Create a Slack incoming webhook, then configure Alertmanager:

alertmanager:
  config:
    global:
      slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

    route:
      group_by: ['alertname', 'cluster', 'namespace']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'slack-alerts'
      routes:
        - match:
            severity: critical
          receiver: 'pagerduty-critical'

    receivers:
      - name: 'slack-alerts'
        slack_configs:
          - channel: '#alerts'
            title: '{{ .GroupLabels.alertname }}'
            text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

      - name: 'pagerduty-critical'
        pagerduty_configs:
          - service_key: 'YOUR_PAGERDUTY_KEY'

Essential Alert Rules

groups:
  - name: application
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_request_duration_seconds_count{status_code=~"5.."}[5m])
          /
          rate(http_request_duration_seconds_count[5m]) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.job }}"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1%)"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High p99 latency on {{ $labels.job }}"
          description: "p99 latency is {{ $value }}s (threshold: 2s)"

      - alert: PodCrashLooping
        expr: |
          rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"

      - alert: NodeDiskPressure
        expr: |
          (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Node {{ $labels.instance }} disk is {{ $value | humanizePercentage }} full"

Long-Term Storage: Moving Beyond 30 Days

Prometheus's local storage is not designed for long-term retention. For metrics older than 30 days, use remote storage:

▹Thanos — open source, integrates directly with Prometheus, stores metrics in S3/GCS
▹Grafana Mimir — Grafana's managed or self-hosted long-term metrics store
▹Cortex — another open-source option, basis for Mimir
▹Victoria Metrics — high-performance alternative with lower resource usage

For most teams starting out, Thanos with S3 is the simplest and cheapest path to 1-year+ metric retention.

Common Monitoring Mistakes

Mistake 1: Alerting on symptoms not causes — Alert on "error rate > 1%" not "CPU > 80%". High CPU is rarely actionable on its own.

Mistake 2: Too many alerts — If your on-call engineer gets more than 5 alerts per week, alert fatigue sets in and real problems get missed. Be ruthless about reducing noise.

Mistake 3: No runbook links in alerts — Every alert should link to a runbook explaining what it means and what to do. Add a runbook_url annotation to every alert rule.

Mistake 4: Not monitoring the monitoring stack — Prometheus going down silently is the worst outcome. Set up an external uptime check (UptimeRobot, Pingdom) on your Grafana and Alertmanager endpoints.

Mistake 5: Keeping default dashboard panels without understanding them — Know what every panel on your dashboards means. If you can't explain a metric, remove the panel.

Need Help Setting Up Monitoring?

We implement Prometheus and Grafana monitoring stacks for engineering teams — from initial setup to custom dashboards, alerting rules, and long-term storage configuration.

Book a free monitoring consultation →

Need hands-on help?

We're a specialist DevOps & Atlassian consulting firm. Book a free call to talk through your specific situation.

Get a Free Consultation