Introduction
In this project, I explored Docker container monitoring using Prometheus and Grafana. The main objective was to run a Rocket.Chat server in Docker containers, using cAdvisor, node_exporter containers to export data to a Prometheus-Grafana server for monitoring.
Prometheus Installation:
First, I updated the instance and downloaded Prometheus (version 2.47.2) to the /opt directory. Here’s what I did:
- name: Update instance.
apt:
update_cache: true
state: latest
- name: Get the Prometheus pkg.
unarchive:
src: https://github.com/prometheus/prometheus/releases/download/v2.47.2/prometheus-2.47.2.linux-amd64.tar.gz
dest: /opt
remote_src: true
validate_certs: yes
Next, I created the Prometheus user and group.
- name: Create prometheus group.
Group:
name: prometheus
state: present
system: yes
- name: Create prometheus user.
user:
name: prometheus
groups: prometheus
system: yes
I set up two essential directories, /etc/prometheus and /var/lib/prometheus, and ensured that the ownership was correctly assigned to the Prometheus user and group. Then, I moved the Prometheus and Promtool binaries from /opt/prometheus-2.47.2.linux-amd64 to /usr/bin
- name: Create /etc/prometheus /var/lib/prometheus.
file:
path: "{{item}}"
state: directory
owner: prometheus
group: prometheus
mode: "0755"
recurse: yes
loop:
- /etc/prometheus
- /var/lib/prometheus
- name: Move prometheus and promtool to the /usr/bin/ directory.
command: mv /opt/prometheus-2.47.2.linux-amd64/{{ item }} /usr/bin/
with_items:
- prometheus
- promtool
I also moved the prometheus.yml file, as well as the consoles and console_libraries, to the /etc/prometheus directory.
- name: Move prometheus.yml file to the /etc/prometheus directory.
command: mv /opt/prometheus-2.47.2.linux-amd64/prometheus.yml /etc/prometheus/prometheus.yml
- name: Move consoles and console_libraries files to the /etc/prometheus directory.
command: mv /opt/prometheus-2.47.2.linux-amd64/{{ item }} /etc/prometheus/
with_items:
- consoles/
- console_libraries/
Prometheus Service Configuration:
To complete the Prometheus setup, I created a systemd service unit file. Here’s the configuration I used.
[Unit] Description=Prometheus Wants=network-online.target After=network-online.target [Service] User=prometheus Group=prometheus Type=simple Restart=on-failure RestartSec=5s ExecStart=/usr/bin/prometheus \ --config.file /etc/prometheus/prometheus.yml \ --storage.tsdb.path /var/lib/prometheus/ \ --web.console.templates=/etc/prometheus/consoles \ --web.console.libraries=/etc/prometheus/console_libraries \ --web.listen-address=0.0.0.0:9090 \ --web.enable-lifecycle \ --log.level=info [Install] WantedBy=multi-user.target
Then, I created the Prometheus service using this template and enabled it.
- name: Create Prometheus Service.
template:
src: roles/prometheus/file/prometheus.service
dest: /lib/systemd/system/
- name: Enable and start the Prometheus service.
ansible.builtin.systemd:
name: prometheus.service
enabled: yes
state: started
masked: no
daemon_reload: yes
Alertmanager Installation:
For Alertmanager, the ansible playbooks followed a similar pattern to the Prometheus installation:
- Downloaded the Alertmanager package.
- Created the Alertmanager user and group.
- Created the /etc/alertmanager directory and moved the alertmanager.yml file there.
- Moved the Alertmanager and amtool binaries to /usr/bin/.
- Created an alertmanager.service to manage the Alertmanager service.
Grafana Installation:
The installation of Grafana was straightforward and accomplished in a few steps
- name: Install some required utilities using apt.
apt:
name:
- apt-transport-https
- software-properties-common
update_cache: yes
- name: Add Grafana GPG key
apt_key:
url: https://apt.grafana.com/gpg.key
state: present
- name: Add the Grafana “stable releases” repository.
apt_repository:
repo: "deb https://apt.grafana.com stable main"
state: present
- name: Update package cache
apt:
update_cache: yes
- name: Install the open-source version of Grafana.
apt:
name: grafana
state: present
- name: Enable grafana.service.
service:
name: grafana-server
enabled: yes
state: started
With this setup, both Prometheus and Grafana were ready to monitor Docker containers effectively.
Configuring Alertmanager and Prometheus
I already had Rocket.Chat, MongoDB, and Nginx containers running in Docker, and I had set up cAdvisor for container metrics and node_exporter for system metrics, with both exporting data to the Prometheus server.
In the Monitoring Server, I defined a rule file for Prometheus to trigger alerts based on specific conditions. I created two groups of rules:
groups:
- name: docker_rules
rules:
- alert: HighCPUUsage-container
expr: (rate(container_cpu_usage_seconds_total{container_label_com_docker_compose_service="rocketchat", cpu="cpu00"}[1m]) > 0.5) or (rate(container_cpu_usage_seconds_total{container_label_com_docker_compose_service="rocketchat", cpu="cpu01"}[1m]) > 0.5)
for: 1m
labels:
severity: 'critical'
service: '{{ $labels.container_label_com_docker_compose_service }}'
annotations:
summary: "High CPU usage in Docker container"
description: "Container '{{ $labels.container_name }}' (instance '{{ $labels.instance }}') has high CPU usage."
- name: server_rules
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: 'critical'
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute."
These rules will help identifying high CPU usage in the Rocket.Chat container and server instances that are down.
Next, I moved to configure Alertmanager in the alertmanager.yml file to send notifications to Slack channel whenever an alert is triggered in Prometheus. Here’s what I configured:
global:
resolve_timeout: 1m
slack_api_url: 'https://hooks.slack.com/services/<sc>'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'slack-notifications'
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alert'
title: "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}"
text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
In the Alertmanager configuration, I specified how to group alerts, the intervals for handling them, and the Slack notification settings. When an alert is triggered, it will send a Slack message to the specified channel with a summary and description. Additionally, it will send a resolved notification once the alert condition is no longer met.
The results:
I used JMeter to simulate the load of 50 users.
Those were my Grafana dashboards that displayed data collected by node_exporter and cAdvisor.
I received messages in my Slack “#alert” channel.