January 3, 2018•blog
Here are some tips and tricks on configuring a GPU cluster, based on my experience. These are not exhaustive.
Use a configuration deployment language, like Ansible, Chef, Puppet, SaltStack, etc. I recommend Ansible for its shallow learning curve and configurability.
There is an initial set-up cost that is worth paying: expect writing the initial configuration to take about two weeks, especially if you test in virtual machines.
This is an immensely valuable step because once the configuration is created, subsequent compute machines can be set up in as little as twenty minutes. As configuration options remain similar between OS versions, you can even use this to easily upgrade and downgrade between major versions of your operating system.
Don’t use them. Take these measures so that passwords are only ever needed for emergency recovery.
PasswordAuthentication and ChallengeResponseAuthentication) to prevent people logging in without an SSH key.https://github.com/username.keys.sudo commands by replacing %sudo ALL=.* with %sudo ALL=(ALL) NOPASSWD: ALL, using visudo.Don’t give your users sudo access. Don’t install packages in the global Python namespace. Instead, provide your users with anaconda, pyenv, python3-venv to install/compile local builds of Python; this allows users to maintain their own Python installations and conda/pip libraries.
Some service accounts require special privileges: for example, the backup agent’s account only needs to read data and should not be able to write anything on the bastion. We do this by authorizing particular command strings in the sudoers file.
To minimize the attack surface, ensure that the bastion host does not have credentials to connect to the backup server; only the backup server should be able to connect to the bastion. This prevents ransomware attacks!
The simplest security model offered by NFS is sufficient for most uses, as long as you have relatively few users. If you assign each user the same username, uid and gid across all machines (trivial to do with Ansible), NFS will correctly apply the default linux access controls.
Use ufw or a similar easy-to-configure firewall. If you configure it with Ansible, you can bundle the firewall rules for each task with the task itself, so you can guarantee that you do not have exposed ports.
During acceptance testing, ensure that outside users cannot access inside-only ports, for example, NFS.
Ensure that you internal network is also firewalled; only allow each compute server to communicate with the bastion. (This needs to be set up in your managed switch.)
There are a few main types of questions that people ask about your cluster. The purpose of condition monitoring is to quantitatively answer these questions:
As a rule of thumb, it is better to collect more data than less, and to retain that data for longer than necessary.
I recommend using Prometheus to collect and hold statistics (increase the retention to a year or more!) You can visualize these with Grafana. (Avoid using the package manager version of Grafana, it is often out of date.)
At minimum, use this script to report statistics to Prometheus.
Ensure that you get alerts if machines go down, your backup job fails, or some other exceptions; Nagios is the industry standard. I prefer a custom Slack bot to handle this.
In addition to simple condition statistics, you should explicitly track jobs running on GPUs. A custom-written GPU dashboard graphing the current availability of GPUs is a days’ work at most.
After its introduction in my research group, this feature was among the most popular.
Forward your syslog from the compute machines to the bastion to be able to trace failures if a machine fails. You can do this by writing this to /etc/rsyslog.d/70-network.conf, permitting port udp:514 in your firewall, and reloading rsyslog:
# Rules for rsyslog.
# Log by facility.
auth,authpriv.* @.local
cron.* @.local
kern.* @.local
user.* @.local
*.emerg @.local
Run smartd automatically; schedule a short test daily and a long test weekly. Configure an alert for this.
These are very easy to get wrong!
rsync or similar to verify your backups.Use a bugtracking system to track system upgrades, open issues, etc. GitHub’s built-in tracker is more than sufficient for this, especially with the new (in 2019) Projects board feature.