Prometheus is used to scrape /metric
endpoints of
applications running in the cluster. Metrics are numeric
measurements, which when scraped over time become time series that describe
how the given application is running.
Metrics follow the following notation:
<metric name>{<label name>=<label value>, ...}
A gauge is a metric that represents a single numerical value that can
arbitrarily go up and down. The following example gauge metric indicates that
http://data.norge.no/
returns the http status code 200
:
probe_http_status_code{ingress="ingress-prod-v4", instance="http://data.norge.no/", namespace="prod"} 200
Another example is a counter metric, whose value may only increase or be reset to zero:
processed_mail_requests{fdk_service="fdk-mail-sender-service", status="success"} 7
One may based on metrics create rules that trigger alerts whenever an expression
is true. E.g. the expression probe_http_status_code{} >= 500
, which indicates
than an application is unable to respond correctly.
See Prometheus Introduction and Metric Types for a more thorough introduction.
Service | Purpose |
---|---|
https://prometheus.fellesdatakatalog.digdir.no/rules | Alert rules overview |
https://karma.fellesdatakatalog.digdir.no | Alerting dashboard |
https://thanos.fellesdatakatalog.digdir.no | Explore and query metrics |
https://alertmanager.fellesdatakatalog.digdir.no | See alerts and silence them |
https://grafana.fellesdatakatalog.digdir.no | Dashboards based on metrics |
See Metric and label naming for best practices.
Use the following pod annotations to configure scraping of metrics:
annotations:
# Enable scraping of metrics.
prometheus.io/scrape: "true"
# Specifies metrics port. Default: container's port.
prometheus.io/port: "8080"
# Specifies metrics path. Default: "/metrics".
prometheus.io/path: "/metrics"
If you need more customization, such as scraping interval, look into using a servicemonitor or podmonitor instead.
See Alerting Rules for how to configure rules.
Alerts fire in the #fdk-dev-alerts
and #fdk-prod-alerts
slack channels.
PrometheusRule
resources in the GitHub repo fdk-infra/infrastructure/base/alerts is used to configure alert rules, and will be automatically synced into Prometheus running in the clusters. Remember to add any new files within the alerts folder to the kustomization.yaml
resources
list.
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: <rule name - kebab case>
namespace: monitoring
labels:
release: monitoring-kube-prometheus-stack
spec:
groups:
- name: fdk
rules:
- alert: <alert name - pascal case>
annotations:
description: <alert description (shown in slack)>
summary: <alert title (shown in slack)>
expr: <alert condition (e.g. "up{} == 0")>
for: <time to wait before alerting (e.g. "0s")>
labels:
severity: <severity (none|info|warning|error|critical)>
dashboard_url: <link to grafana/kibana dashboard, if any>