Metrics

101

Prometheus is used to scrape /metric endpoints of applications running in the cluster. Metrics are numeric measurements, which when scraped over time become time series that describe how the given application is running.

Metrics follow the following notation:

<metric name>{<label name>=<label value>, ...}

A gauge is a metric that represents a single numerical value that can arbitrarily go up and down. The following example gauge metric indicates that http://data.norge.no/ returns the http status code 200:

probe_http_status_code{ingress="ingress-prod-v4", instance="http://data.norge.no/", namespace="prod"} 200

Another example is a counter metric, whose value may only increase or be reset to zero:

processed_mail_requests{fdk_service="fdk-mail-sender-service", status="success"} 7

One may based on metrics create rules that trigger alerts whenever an expression is true. E.g. the expression probe_http_status_code{} >= 500, which indicates than an application is unable to respond correctly.

See Prometheus Introduction and Metric Types for a more thorough introduction.

Metric services

ServicePurpose
https://prometheus.fellesdatakatalog.digdir.no/rulesAlert rules overview
https://karma.fellesdatakatalog.digdir.noAlerting dashboard
https://thanos.fellesdatakatalog.digdir.noExplore and query metrics
https://alertmanager.fellesdatakatalog.digdir.noSee alerts and silence them
https://grafana.fellesdatakatalog.digdir.noDashboards based on metrics

Create metrics endpoint and expose metrics

See Metric and label naming for best practices.

Configure metrics scraping

Use the following pod annotations to configure scraping of metrics:

annotations:
    # Enable scraping of metrics.
    prometheus.io/scrape: "true"
    # Specifies metrics port. Default: container's port.
    prometheus.io/port: "8080"
    # Specifies metrics path. Default: "/metrics".
    prometheus.io/path: "/metrics"

If you need more customization, such as scraping interval, look into using a servicemonitor or podmonitor instead.

Create alert rules

See Alerting Rules for how to configure rules.
Alerts fire in the #fdk-dev-alerts and #fdk-prod-alerts slack channels.

PrometheusRule resources in the GitHub repo fdk-infra/infrastructure/base/alerts is used to configure alert rules, and will be automatically synced into Prometheus running in the clusters. Remember to add any new files within the alerts folder to the kustomization.yaml resources list.

---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: <rule name - kebab case>
  namespace: monitoring
  labels:
    release: monitoring-kube-prometheus-stack
spec:
  groups:
    - name: fdk
      rules:
        - alert: <alert name - pascal case>
          annotations:
            description: <alert description (shown in slack)>
            summary: <alert title (shown in slack)>
          expr: <alert condition (e.g. "up{} == 0")>
          for: <time to wait before alerting (e.g. "0s")>
          labels:
            severity: <severity (none|info|warning|error|critical)>
            dashboard_url: <link to grafana/kibana dashboard, if any>