Skip to main content

Troubleshooting NDIP Platform

Identifying Problems

Slack Channels

Various aspects of the system—including system services, file systems, and URL endpoints—are monitored by the Prometheus software stack. Alarms are reported to our Slack channels (ndip-status-prod and ndip-status-test). Each alarm includes a description of the problem and the IP address of the node where it occurred.

Automatic GitLab Tests

  • System Tests: We have system tests that run a set of tools on all compute resources every two hours. Failures are reported in the Slack channel, including the name of the failing test and a link to the corresponding GitLab pipeline.
  • Nightly Tests: Nightly tests for our tools are also conducted. Failures are sent via email by GitLab to the pipeline owner (currently Sergey).

Manual Tests

You can quickly run a tool manually from calvera/calvera-test to verify that it works as expected. This is especially useful for interactive tools since they cannot be tested effectively with automated methods.

Resolving Problems

The resolution steps depend on the type of problem. Below are some common scenarios and solutions.

note

Most issues arise from infrastructure problems, such as unavailability, restarts, disk space issues, or network problems. Some issues resolve on their own when the infrastructure recovers, while others require manual intervention.

Accessing Nodes

To log in to calvera(-test).ornl.gov, use the appropriate private SSH key, available here. If you need to login to a Pulsar node, see the Pulsar administrator notes.

Disk Full

We have automated clean-up processes, but not for all cases. To address disk space issues log in to the affected node and identify the issue.

Useful commands:

df -h                  # Check mounted folders and their capacity  
du -h -d 1 somefolder # Identify large folders/files and clean them up if possible

If the issue involves CEPH, you can increase the CEPH share size "on the fly" using:

Service Down

To resolve service outages:

  1. Log in to the node.
  2. Check the status of the service, review logs, and restart if necessary.

Useful commands:

systemctl status service_name  
cat /var/log/log_file_name
systemctl restart service_name

No Access to Node

If a node is inaccessible:

  1. Log in to OpenStack and check the node's status.
  2. Check if the node is in the "stopped" state, start it again.

No Access to a Filesystem

If a filesystem is inaccessible:

  1. Log in to the node.
  2. Verify if the filesystem is mounted. If not, remount it using:

Useful commands:

df -h  
umount path_to_folder
mount -a
  • SNS/HFIR Filesystems: Check for permission issues.
  • CEPH Filesystems: Review error messages. It is generally safe to delete CEPH mounts used for Docker but not for data mounts (ceph_share_test, ceph_share).

Galaxy Test Failures

To address failures in Galaxy tests:

  1. Check for infrastructure alerts in Slack.

  2. Log in to calvera(-test) and:

    • Check Galaxy logs at /srv/galaxy/var/log/.
    • Verify the status of services (galaxy, rdb, rabbitmq-server, postgresql@14-main).
    • Restart services if needed.
    • Check mounted systems.
  3. Log in to the Pulsar node where the failing test ran:

    • Check logs in /var/log/pulsar or /var/log/pulsar_test.
    • Verify services (pulsar, rdb on sns-ndip-gpu.sns.gov).
    • Confirm that systems are mounted correctly.

Authorization Problems

We support two OIDC providers - AzureAD and PingFed. If there are any problems there - login to Galaxy and token updates might fail.

  • For PingFed - create a support ticket
  • For AzureAD - try to login here and check if everything looks ok (expired certificates, etc). Greg Watson and Greg Cage should have access there.