Troubleshooting NDIP Platform
Identifying Problems
Slack Channels
Various aspects of the system—including system services, file systems, and URL endpoints—are monitored by the Prometheus software stack. Alarms are reported to our Slack channels (ndip-status-prod
and ndip-status-test
). Each alarm includes a description of the problem and the IP address of the node where it occurred.
Automatic GitLab Tests
- System Tests: We have system tests that run a set of tools on all compute resources every two hours. Failures are reported in the Slack channel, including the name of the failing test and a link to the corresponding GitLab pipeline.
- Nightly Tests: Nightly tests for our tools are also conducted. Failures are sent via email by GitLab to the pipeline owner (currently Sergey).
Manual Tests
You can quickly run a tool manually from calvera
/calvera-test
to verify that it works as expected. This is especially useful for interactive tools since they cannot be tested effectively with automated methods.
Resolving Problems
The resolution steps depend on the type of problem. Below are some common scenarios and solutions.
Most issues arise from infrastructure problems, such as unavailability, restarts, disk space issues, or network problems. Some issues resolve on their own when the infrastructure recovers, while others require manual intervention.
To log in to calvera(-test).ornl.gov
, use the appropriate private SSH key, available here. If you need to login to a Pulsar node, see the Pulsar administrator notes.
Disk Full
We have automated clean-up processes, but not for all cases. To address disk space issues log in to the affected node and identify the issue.
Useful commands:
df -h # Check mounted folders and their capacity
du -h -d 1 somefolder # Identify large folders/files and clean them up if possible
If the issue involves CEPH, you can increase the CEPH share size "on the fly" using:
- OpenStack
- Or (preferably) GitLab CI/CD.
Service Down
To resolve service outages:
- Log in to the node.
- Check the status of the service, review logs, and restart if necessary.
Useful commands:
systemctl status service_name
cat /var/log/log_file_name
systemctl restart service_name
No Access to Node
If a node is inaccessible:
- Log in to OpenStack and check the node's status.
- Check if the node is in the "stopped" state, start it again.
No Access to a Filesystem
If a filesystem is inaccessible:
- Log in to the node.
- Verify if the filesystem is mounted. If not, remount it using:
Useful commands:
df -h
umount path_to_folder
mount -a
- SNS/HFIR Filesystems: Check for permission issues.
- CEPH Filesystems: Review error messages. It is generally safe to delete CEPH mounts used for Docker but not for data mounts (
ceph_share_test
,ceph_share
).
Galaxy Test Failures
To address failures in Galaxy tests:
Check for infrastructure alerts in Slack.
Log in to
calvera(-test)
and:- Check Galaxy logs at
/srv/galaxy/var/log/
. - Verify the status of services (
galaxy
,rdb
,rabbitmq-server
,postgresql@14-main
). - Restart services if needed.
- Check mounted systems.
- Check Galaxy logs at
Log in to the Pulsar node where the failing test ran:
- Check logs in
/var/log/pulsar
or/var/log/pulsar_test
. - Verify services (
pulsar
,rdb
onsns-ndip-gpu.sns.gov
). - Confirm that systems are mounted correctly.
- Check logs in
Authorization Problems
We support two OIDC providers - AzureAD and PingFed. If there are any problems there - login to Galaxy and token updates might fail.
- For PingFed - create a support ticket
- For AzureAD - try to login here and check if everything looks ok (expired certificates, etc). Greg Watson and Greg Cage should have access there.