CONTAINERS MONITORING USING PROMETHEUS & GRAFANA

Introduction

In this project, I explored Docker container monitoring using Prometheus and Grafana. The main objective was to run a Rocket.Chat server in Docker containers, using cAdvisor, node_exporter containers to export data to a Prometheus-Grafana server for monitoring.

Prometheus Installation:

First, I updated the instance and downloaded Prometheus (version 2.47.2) to the /opt directory. Here’s what I did:

- name: Update instance.
  apt:
    update_cache: true
    state: latest

- name: Get the Prometheus pkg.
  unarchive:
    src: https://github.com/prometheus/prometheus/releases/download/v2.47.2/prometheus-2.47.2.linux-amd64.tar.gz
    dest: /opt
    remote_src: true
    validate_certs: yes	

Next, I created the Prometheus user and group.

- name: Create prometheus group.
  Group:	
    name: prometheus
    state: present
    system: yes

- name: Create prometheus user.
  user:
    name: prometheus
    groups: prometheus
    system: yes

I set up two essential directories, /etc/prometheus and /var/lib/prometheus, and ensured that the ownership was correctly assigned to the Prometheus user and group. Then, I moved the Prometheus and Promtool binaries from /opt/prometheus-2.47.2.linux-amd64 to /usr/bin

- name: Create /etc/prometheus /var/lib/prometheus.
  file:
    path: "{{item}}"
    state: directory
    owner: prometheus
    group: prometheus
    mode: "0755"
    recurse: yes
  loop:
    - /etc/prometheus
    - /var/lib/prometheus

- name: Move prometheus and promtool to the /usr/bin/ directory.
  command: mv /opt/prometheus-2.47.2.linux-amd64/{{ item }} /usr/bin/
  with_items:
    - prometheus
    - promtool

I also moved the prometheus.yml file, as well as the consoles and console_libraries, to the /etc/prometheus directory.

- name: Move prometheus.yml file to the /etc/prometheus directory.
  command: mv /opt/prometheus-2.47.2.linux-amd64/prometheus.yml /etc/prometheus/prometheus.yml

- name: Move consoles and console_libraries files to the /etc/prometheus directory.
  command: mv /opt/prometheus-2.47.2.linux-amd64/{{ item }} /etc/prometheus/
  with_items:
    - consoles/
    - console_libraries/

Prometheus Service Configuration:

To complete the Prometheus setup, I created a systemd service unit file. Here’s the configuration I used.

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
Restart=on-failure
RestartSec=5s
ExecStart=/usr/bin/prometheus \
    --config.file /etc/prometheus/prometheus.yml \
    --storage.tsdb.path /var/lib/prometheus/ \
    --web.console.templates=/etc/prometheus/consoles \
    --web.console.libraries=/etc/prometheus/console_libraries \
    --web.listen-address=0.0.0.0:9090 \
    --web.enable-lifecycle \
    --log.level=info

[Install]
WantedBy=multi-user.target

Then, I created the Prometheus service using this template and enabled it.

- name: Create Prometheus Service.
  template:
    src: roles/prometheus/file/prometheus.service
    dest: /lib/systemd/system/

- name: Enable and start the Prometheus service.
  ansible.builtin.systemd:
    name: prometheus.service
    enabled: yes
    state: started
    masked: no
    daemon_reload: yes

Alertmanager Installation:

For Alertmanager, the ansible playbooks followed a similar pattern to the Prometheus installation:

  • Downloaded the Alertmanager package.
  • Created the Alertmanager user and group.
  • Created the /etc/alertmanager directory and moved the alertmanager.yml file there.
  • Moved the Alertmanager and amtool binaries to /usr/bin/.
  • Created an alertmanager.service to manage the Alertmanager service.

Grafana Installation:

The installation of Grafana was straightforward and accomplished in a few steps

- name: Install some required utilities using apt.
  apt:
    name:
      - apt-transport-https
      - software-properties-common
    update_cache: yes 

- name: Add Grafana GPG key
  apt_key:
    url: https://apt.grafana.com/gpg.key
    state: present

- name: Add the Grafana “stable releases” repository.
  apt_repository:
    repo: "deb https://apt.grafana.com stable main"
    state: present

- name: Update package cache
  apt:
    update_cache: yes
    
- name: Install the open-source version of Grafana.
  apt:
    name: grafana
    state: present

- name: Enable grafana.service.
  service:
    name: grafana-server
    enabled: yes
    state: started

With this setup, both Prometheus and Grafana were ready to monitor Docker containers effectively.

Configuring Alertmanager and Prometheus

I already had Rocket.Chat, MongoDB, and Nginx containers running in Docker, and I had set up cAdvisor for container metrics and node_exporter for system metrics, with both exporting data to the Prometheus server.

In the Monitoring Server, I defined a rule file for Prometheus to trigger alerts based on specific conditions. I created two groups of rules:

groups:
  - name: docker_rules
    rules:
      - alert: HighCPUUsage-container
        expr: (rate(container_cpu_usage_seconds_total{container_label_com_docker_compose_service="rocketchat", cpu="cpu00"}[1m]) > 0.5) or (rate(container_cpu_usage_seconds_total{container_label_com_docker_compose_service="rocketchat", cpu="cpu01"}[1m]) > 0.5)
        for: 1m
        labels:
          severity: 'critical'
          service: '{{ $labels.container_label_com_docker_compose_service }}'
        annotations:
          summary: "High CPU usage in Docker container"
          description: "Container '{{ $labels.container_name }}' (instance '{{ $labels.instance }}') has high CPU usage."

  - name: server_rules
    rules:
     - alert: InstanceDown
       expr: up == 0
       for: 1m
       labels:
         severity: 'critical'
       annotations:
         summary: "Instance {{ $labels.instance }} down"
         description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute."

These rules will help identifying high CPU usage in the Rocket.Chat container and server instances that are down.

Next, I moved to configure Alertmanager in the alertmanager.yml file to send notifications to Slack channel whenever an alert is triggered in Prometheus. Here’s what I configured:

global:
  resolve_timeout: 1m
  slack_api_url: 'https://hooks.slack.com/services/<sc>'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'slack-notifications'

receivers:
- name: 'slack-notifications'
  slack_configs:
  - channel: '#alert'
    title: "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}"
    text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"

In the Alertmanager configuration, I specified how to group alerts, the intervals for handling them, and the Slack notification settings. When an alert is triggered, it will send a Slack message to the specified channel with a summary and description. Additionally, it will send a resolved notification once the alert condition is no longer met.

The results:

I used JMeter to simulate the load of 50 users.

Those were my Grafana dashboards that displayed data collected by node_exporter and cAdvisor.

Rocket.Chat server’s CPU was really high.
The orange line indicated that the Rocket.Chat container was using more than 50% of the CPU, which had already triggered the alert.

The “HighCPUUsage-container” was in the “Firing” mode.

I received messages in my Slack “#alert” channel.