Skip to content

Metrics

Note

In the documentation :

  • monitoring/caascad designates the grouping of tools for monitoring your K8s clusters by Caascad
  • monitoring/user designates the grouping of tools dedicated to your monitoring (applications, infra...)

Prometheus-operator

Prometheus operator is a Kubernetes operator. It is deployed on the Kubernetes cluster and it operates Prometheus.

Deployed in:

  • customer environment in caascad-monitoring namespace
  • administration environment

Prometheus

Prometheus is the monitoring tool. It scrapes everywhere it can. Then it stores the metrics and check rules for alerting. Prometheus is deployed as a Kubernetes CRD and Prometheus-operator deploys a statefulset.app and pods.

Deployed in:

  • each customer environment in caascad-monitoring namespace
  • administration environment in monitoring/caascad
  • administration environment in monitoring/user

Prometheus deployed in monitoring/caascad gathers K8s metrics of the deployed Prometheus in customer environment.

Prometheus deployed in monitoring/user gathers K8s metrics and application metrics of the deployed Prometheus in customer environment.

How to

To add application metrics in Prometheus monitoring/user see How to : add servicemonitor.

Important

All metrics in Prometheus deployed in the administration environment (monitoring/caascad and monitoring/user) have the following labels:

  • cc_prom_source: value is CLUSTER_NAME to identify customer environment
  • cc_prom: indicates which Alertmanager sends alerts (monitoring/caascad or monitoring/user)
  • cc_client: value is ZONE_NAME

Thanos

Thanos is the long term storage solution for Prometheus. Data are stored in a S3 bucket.

Thanos is made of many components:

  • a sidecar for Prometheus: gets the metrics of Prometheus and stores them in the S3 bucket.
  • a store gateway: be a gateway for all operations that need access to the S3 bucket (except initial storage : the sidecar does it)
  • a querier: listens for queries (be a datasource for Grafana) and requests all storage places (the sidecar and the store gateway)
  • a compactor: loads the data stored in S3
    • compacts it
    • generates downsampled data
    • stores the results back in S3

Note

Thanos has other components but they are not deployed.

A schema can be found on https://github.com/thanos-io/thanos.

Deployed in:

  • administration environment in monitoring/caascad
  • administration environment in monitoring/user

Each tool group has a specific metrics retention policy.

See:

Exporters

Exporters are sources of metrics. They are independant pieces of software with 2 functions:

  1. gather the metrics from where they are (system metrics, Postgresql, NGinx...)
  2. expose the metrics on a HTTP server

Prometheus scrapes them by connecting to the HTTP server of the exporters.

There are many possible exporters. Applications are supposed to deploy them. However, some basic exporters are deployed with the stack:

  • node exporter: get the system metrics of the operating system (CPU, mem...)
  • kubelet
  • ...

Several Prometheus exporters are deployed.

Node-exporter

Deployed in customer environment caascad-monitoring namespace.

See:

Kube-state-metrics

Deployed in customer environment caascad-monitoring namespace.

Allows to obtain metrics coming from Kubernetes objects like pods, configmaps...

See:

Blackbox-exporter

Deployed in customer environment in caascad-monitoring namespace.

Allows to supervise external services (in Blackbox-exporter there are called targets).

There are 4 probes:

  • http requests
  • tcp requests
  • dns requests
  • icmp

To add an external service to monitor, you have to:

  • define a target
  • define a module which uses one of the 4 probes

See:

S3-usage-exporter

Limitation

At the moment this feature is implemented only for Flexible Engine.

S3-usage-exporter is a Prometheus exporter developed by Caascad Team.

Deployed in administration environment.

Allows to obtains S3 bucket size and S3 bucket object number for:

  • docker registry storage backend
  • logs storage backend
  • metrics storage backend

Note

S3 metrics are only available in the Thanos datasource.

See:

Grafana

Deployed in administration environment.

Connected to:

  • Thanos deployed in monitoring/caascad, metrics visible with the Thanos datasource
  • Thanos deployed in monitoring/user, metrics visible with the Thanos-app datasource

Several dashboards are deployed by default:

  • General/Storage S3: to view S3 metrics
  • General/Blackbox-exporter: to view probe metrics
  • Kubernetes/...: to view K8s metrics (node-exporter, kubelet, kube-state-metrics ...).

See:

Alertmanager

The Alertmanager is a router for alerts. It listens for alerts (usually from Prometheus). Some routing rules are applied and the alerts are forwarded to other tools (like mails, sms...)

Deployed and connected to Prometheus in:

  • administration environment in monitoring/caascad
  • administration environment in monitoring/user

PrometheusRules

Several alerting rules are deployed by default in administration environment: managed by Prometheus/Alertmanager and used by Caascad team to monitor your environments (deployed in monitoring/caascad).

Several recording rules are deployed by default in administration environment: managed by Prometheus and used in Grafana K8s dashboards (deployed in monitoring/caascad).

How to

To write PrometheusRules in monitoring/user, see How to write a PrometheusRule.

Karma

See:

Karma is a user-friendly user interface for Alertmanager. It shows alerts and helps adding silences on alerts.

Deployed and connected to Alertmanager in monitoring/user.

See:

Kthxbye

Kthxbye is an application that allows you to create renewable silences.

Deployed and connected to Alertmanager in monitoring/user.

How to

To add a silence, see How to silence an alert.

Retention Policy

Thanos retains metric data as follows:

  • Data points with the same period as scrape interval at the moment of data ingestion, configured with raw retention.
  • Data points with a period of 5 minutes are available for x days, configured with 5m retention.
  • Data points with a period of 1 hour are available for y days, configured with 1h retention.

Note

When an administration environment is provisioned, customer specific retention period is applied in monitoring/user.

If no cutomer specific retention is specified, default retention period will be applied: 15 days for raw, 30 days for 5m, 60 days for 1h.

At any time this retention can be modified through Change Request made to Caascad customer support.

Default retention in monitoring/caascad: 15 days for raw, 92 days for 5m, 190 days for 1h.