Overview
We use a set of services to monitor NDIP state and trigger alerts in case there are problems. We monitor system services, docker containers, disk space, system tests, tools tests, etc.
Below is a list of services that are used to provide monitoring:
Service | Host | Configuration |
---|---|---|
Node Exporters | on each VM we deploy | https://github.com/neutrons/post_processing_agent |
Prometheus Stack | prometheus_push_gateway | Ansible playbook |
Slack | slack.com | Ansible playbook |
Further details about each service are provided in the corresponding subsections.
What is monitored
Metric | Source |
---|---|
Systemd services | Node Exporter |
Disk space | Node Exporter |
Docker response time & number of running containers | Push Gateway, Docker metrics |
Web services | Black Box |
System tests | Push Gateway |
Tool tests | Push Gateway |
Take a look at the Alert rules for more details