Slurm Workload Manager (formerly known as Simple Linux Utility for Resource Management or SLURM), or Slurm, is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and computer clusters.
Questions tagged [slurm]
53 questions
8
votes
1 answer
Slurm node daemon error: Can't open PID file
I run systemctl start slurmd.service, and it times out:
Job for slurmd.service failed because a timeout was exceeded.
The relevant lines from running systemctl status slurmd.service:
Mar 23 17:13:42 fedora1 systemd[1]: Starting Slurm node…
user3273814
- 213
4
votes
3 answers
Randomize Slurm Node Allocation
Has anyone had luck randomizing Slurm node allocations? We have a small cluster of 12 nodes that could be used by anywhere from 1-8 people at a time with jobs of various size/length. When testing our new Slurm setup, jobs always go to the first node…
tnallen
- 41
4
votes
1 answer
Why does Slurm fail to start with systemd but work when starting manually?
I've just set up slurm where one physical machine will be the only system in the cluster (so far). This is on Ubuntu 18.04.
I have slurmdbd running, but when I attempt to start up slurmd and slurmctld this times out. Why?
I'm issuing the following…
deltafft
- 41
3
votes
1 answer
Unable to contact slurm controller
I followed the steps to troubleshoot here: https://slurm.schedmd.com/troubleshoot.html.
When running scontrol show slurmd, I get:
Active Steps = NONE
Actual CPUs = 1
Actual Boards = 1
Actual sockets =…
user3273814
- 213
3
votes
1 answer
ssh directly into a specific node on a cluster, without first ssh into login node?
I usually log on to a cluster, start a slurm interactive job, then I am able to ssh into specific running nodes.
My questions is, is it generally possible to ssh into a specific node from my local machine, without first ssh-ing into a login node? I…
georg
- 131
3
votes
0 answers
Managing SLURM memory on single node installation (issues)
I have SLURM setup on a single CentOS 7 node with 64 cores (128 CPU's). I have been using SLURM to submit jobs successfully using both srun and sbatch. However, it is with the caveat that I don't allocate memory. I can allocate CPU's, but not…
Wesley
- 81
3
votes
1 answer
SLURM with "partial" head node
I am trying to install SLURM with NFS on a small ubuntu 18.04 HPC cluster, in a typical fashion, e.g. configure controller (slurmctld) and clients (slurmd) and shared directory, etc. What I am curious about is, is there a way to set it up such that…
rage_man
- 133
3
votes
2 answers
Query peak GPU memory used by finished job
I have a SLURM job I submit with sbatch, such as
sbatch --gres gpu:Tesla-V100:1 job.sh
job.sh trains a model on a V100 GPU. The code itself does not log GPU memory usage.
Is there a SLURM command to query peak GPU memory usage once the job is…
Mathias Müller
- 183
2
votes
0 answers
Slurm - Does it maintain ccNUMA?
Does a SLURM cluster control, maintain or enforce Cache Coherence across the Nodes? Is it a configuration property, or does something like this not exist? I can't find anything inside the docs.
Semo
- 271
2
votes
1 answer
slurmdbd fails to start (initial installation)
I tried to install slurmdbd for accounting on a Ubuntu 16.04 from the standard repositories (version: 15.08.7-1build1).
Here are the commands:
$ sudo apt-get install mysql-server
$ sudo mysql
> create user 'slurm'@'localhost' identified by…
Sethos II
- 537
2
votes
1 answer
How to upgrade Slurm?
I've been asked to upgrade our Slurm Workload Manager installation. I have a slurm 2.3.4 on a Debian 7.0 wheezy cluster (1 master + 8 nodes). I've not installed it so I'm a bit confused about how to do this and how to proceed without destroying…
Sasha Grievus
- 233
2
votes
3 answers
Setting up a cluster with workload distribution
I want to setup a server cluster which can keep by servers as busy as possible while still giving fair compute time to everyone. I have setup a basic Kubernetes setup but the issue is that if some user releases a pod which can parallelize upto say…
starhawk
- 21
2
votes
2 answers
Can I release stale allocated GRES on a Slurm node?
Is there any way to clear stale allocated GRES in Slurm?
I have one node where 4 GPUs are allocated while no jobs are running on the node. Rebooting the node does not release the GPUs.
user@control1:~$ scontrol show node node2
NodeName=node2…
Gerald Schneider
- 26,582
- 8
- 65
- 97
2
votes
1 answer
Slurm Config for limit the endtime for all new jobs on the cluster to a certain date
Our cluster has to be shutdown for an update in two weeks. We would like to let users use the cluster until the last day, but we want to make sure, no job can be started, which would end after the shutdown date. Is there an easy way to limit the…
stupidstudent
- 123
2
votes
1 answer
How can I set up interactive-job-only or batch-job-only partition on a SLURM cluster?
I'm managing a PBS/torque HPC cluster, and now I'm setting up another cluster with SLURM. On the PBS cluster, I can set a queue to accept only interactive jobs by qmgr -c "set queue interactive_q disallowed_types = batch" and to accept only batch…
wdg
- 153