Why Prometheus and Grafana?
Prometheus and Grafana have become the de facto standard for open-source monitoring. Prometheus scrapes and stores time-series metrics. Grafana visualises them. Together they give you a monitoring stack that rivals expensive SaaS tools at a fraction of the cost — and with much more flexibility.
This guide walks you through a production-ready setup, from first installation to your first alerts firing.
Architecture Overview
Before installing anything, understand how the pieces fit together:
- ▹Prometheus — scrapes metrics from your applications and infrastructure on a pull model, stores them as time-series data, evaluates alerting rules
- ▹Alertmanager — receives alerts from Prometheus and routes them to the right destination (Slack, PagerDuty, email)
- ▹Grafana — connects to Prometheus as a data source and renders dashboards
- ▹Exporters — small programs that expose metrics in Prometheus format (Node Exporter for Linux servers, kube-state-metrics for Kubernetes, etc.)
Installation on Kubernetes (Recommended)
The easiest way to install the full stack on Kubernetes is the kube-prometheus-stack Helm chart. It installs Prometheus, Alertmanager, Grafana, and all the exporters you need in one command.
# Add the Prometheus community Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install the full stack in a monitoring namespace
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace --set grafana.adminPassword=your-secure-password --set prometheus.prometheusSpec.retention=30d --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50GiThis gives you: Prometheus with 30-day retention, Grafana with default dashboards, Node Exporter on every node, kube-state-metrics for Kubernetes object metrics, and Alertmanager ready to configure.
Instrumenting Your Application
Out-of-the-box, Prometheus monitors your infrastructure. To monitor your application, you need to expose metrics from it.
Node.js Example
const client = require('prom-client')
// Create a Registry
const register = new client.Registry()
// Add default metrics (CPU, memory, event loop lag)
client.collectDefaultMetrics({ register })
// Custom metric: HTTP request duration
const httpDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5],
registers: [register],
})
// Middleware to track every request
app.use((req, res, next) => {
const end = httpDuration.startTimer()
res.on('finish', () => {
end({
method: req.method,
route: req.route?.path || req.path,
status_code: res.statusCode,
})
})
next()
})
// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType)
res.end(await register.metrics())
})Python (FastAPI) Example
from prometheus_fastapi_instrumentator import Instrumentator
app = FastAPI()
# One line to add metrics to a FastAPI app
Instrumentator().instrument(app).expose(app)Tell Prometheus to Scrape Your App
Add a ServiceMonitor (if using kube-prometheus-stack):
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
namespace: monitoring
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: http
path: /metrics
interval: 15sEssential Grafana Dashboards
The kube-prometheus-stack includes pre-built dashboards, but these are the ones you should set up immediately:
1. Kubernetes Cluster Overview
Dashboard ID: 315 — import from grafana.com. Shows node CPU, memory, pod counts, and cluster-level resource usage.
2. Node Exporter Full
Dashboard ID: 1860 — detailed per-node metrics: CPU steal time, disk I/O, network throughput, memory pressure.
3. Kubernetes Deployment
Dashboard ID: 8588 — per-deployment metrics: replicas desired vs. available, pod restarts, resource requests vs. limits.
4. Your Application Dashboard (Build This)
Create a custom dashboard with 4 panels for the golden signals:
Panel 1 — Request Rate
rate(http_request_duration_seconds_count[5m])Panel 2 — Error Rate
rate(http_request_duration_seconds_count{status_code=~"5.."}[5m])
/
rate(http_request_duration_seconds_count[5m])Panel 3 — Latency (p99)
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
)Panel 4 — Saturation (Pod CPU)
rate(container_cpu_usage_seconds_total{container="my-app"}[5m])Setting Up Alerting
Configure Alertmanager for Slack
Create a Slack incoming webhook, then configure Alertmanager:
alertmanager:
config:
global:
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
group_by: ['alertname', 'cluster', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'slack-alerts'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
receivers:
- name: 'slack-alerts'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'Essential Alert Rules
groups:
- name: application
rules:
- alert: HighErrorRate
expr: |
rate(http_request_duration_seconds_count{status_code=~"5.."}[5m])
/
rate(http_request_duration_seconds_count[5m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.job }}"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1%)"
- alert: HighLatency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High p99 latency on {{ $labels.job }}"
description: "p99 latency is {{ $value }}s (threshold: 2s)"
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 15m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
- alert: NodeDiskPressure
expr: |
(node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15
for: 10m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.instance }} disk is {{ $value | humanizePercentage }} full"Long-Term Storage: Moving Beyond 30 Days
Prometheus's local storage is not designed for long-term retention. For metrics older than 30 days, use remote storage:
- ▹Thanos — open source, integrates directly with Prometheus, stores metrics in S3/GCS
- ▹Grafana Mimir — Grafana's managed or self-hosted long-term metrics store
- ▹Cortex — another open-source option, basis for Mimir
- ▹Victoria Metrics — high-performance alternative with lower resource usage
For most teams starting out, Thanos with S3 is the simplest and cheapest path to 1-year+ metric retention.
Common Monitoring Mistakes
Mistake 1: Alerting on symptoms not causes — Alert on "error rate > 1%" not "CPU > 80%". High CPU is rarely actionable on its own.
Mistake 2: Too many alerts — If your on-call engineer gets more than 5 alerts per week, alert fatigue sets in and real problems get missed. Be ruthless about reducing noise.
Mistake 3: No runbook links in alerts — Every alert should link to a runbook explaining what it means and what to do. Add a runbook_url annotation to every alert rule.
Mistake 4: Not monitoring the monitoring stack — Prometheus going down silently is the worst outcome. Set up an external uptime check (UptimeRobot, Pingdom) on your Grafana and Alertmanager endpoints.
Mistake 5: Keeping default dashboard panels without understanding them — Know what every panel on your dashboards means. If you can't explain a metric, remove the panel.
Need Help Setting Up Monitoring?
We implement Prometheus and Grafana monitoring stacks for engineering teams — from initial setup to custom dashboards, alerting rules, and long-term storage configuration.
Book a free monitoring consultation →
Need hands-on help?
We're a specialist DevOps & Atlassian consulting firm. Book a free call to talk through your specific situation.
Get a Free Consultation