Highest Voted 'slurm' Questions

8

votes

1 answer

Slurm node daemon error: Can't open PID file

I run systemctl start slurmd.service, and it times out: Job for slurmd.service failed because a timeout was exceeded. The relevant lines from running systemctl status slurmd.service: Mar 23 17:13:42 fedora1 systemd[1]: Starting Slurm node…

slurm

asked Mar 23 '19 at 22:29

user3273814

213

4

votes

3 answers

Randomize Slurm Node Allocation

Has anyone had luck randomizing Slurm node allocations? We have a small cluster of 12 nodes that could be used by anywhere from 1-8 people at a time with jobs of various size/length. When testing our new Slurm setup, jobs always go to the first node…

linux cluster hpc job-scheduler slurm

asked Oct 30 '17 at 23:14

tnallen

41

4

votes

1 answer

Why does Slurm fail to start with systemd but work when starting manually?

I've just set up slurm where one physical machine will be the only system in the cluster (so far). This is on Ubuntu 18.04. I have slurmdbd running, but when I attempt to start up slurmd and slurmctld this times out. Why? I'm issuing the following…

ubuntu-18.04 slurm

asked Feb 14 '20 at 10:45

deltafft

41

3

votes

1 answer

Unable to contact slurm controller

I followed the steps to troubleshoot here: https://slurm.schedmd.com/troubleshoot.html. When running scontrol show slurmd, I get: Active Steps = NONE Actual CPUs = 1 Actual Boards = 1 Actual sockets =…

slurm

asked Mar 28 '19 at 22:16

user3273814

213

3

votes

1 answer

ssh directly into a specific node on a cluster, without first ssh into login node?

I usually log on to a cluster, start a slurm interactive job, then I am able to ssh into specific running nodes. My questions is, is it generally possible to ssh into a specific node from my local machine, without first ssh-ing into a login node? I…

ssh cluster slurm emacs

asked Sep 23 '22 at 00:37

georg

131

3

votes

0 answers

Managing SLURM memory on single node installation (issues)

I have SLURM setup on a single CentOS 7 node with 64 cores (128 CPU's). I have been using SLURM to submit jobs successfully using both srun and sbatch. However, it is with the caveat that I don't allocate memory. I can allocate CPU's, but not…

slurm

asked Nov 02 '21 at 13:35

Wesley

81

3

votes

1 answer

SLURM with "partial" head node

I am trying to install SLURM with NFS on a small ubuntu 18.04 HPC cluster, in a typical fashion, e.g. configure controller (slurmctld) and clients (slurmd) and shared directory, etc. What I am curious about is, is there a way to set it up such that…

hpc slurm

asked Jan 02 '21 at 03:18

rage_man

133

3

votes

2 answers

Query peak GPU memory used by finished job

I have a SLURM job I submit with sbatch, such as sbatch --gres gpu:Tesla-V100:1 job.sh job.sh trains a model on a V100 GPU. The code itself does not log GPU memory usage. Is there a SLURM command to query peak GPU memory usage once the job is…

linux cluster memory-usage slurm

asked Mar 11 '20 at 07:45

Mathias Müller

183

2

votes

0 answers

Slurm - Does it maintain ccNUMA?

Does a SLURM cluster control, maintain or enforce Cache Coherence across the Nodes? Is it a configuration property, or does something like this not exist? I can't find anything inside the docs.

numa slurm

asked Apr 29 '19 at 20:39

Semo

271

2

votes

1 answer

slurmdbd fails to start (initial installation)

I tried to install slurmdbd for accounting on a Ubuntu 16.04 from the standard repositories (version: 15.08.7-1build1). Here are the commands: $ sudo apt-get install mysql-server $ sudo mysql > create user 'slurm'@'localhost' identified by…

linux ubuntu-16.04 slurm

asked Feb 06 '18 at 07:47

Sethos II

537

2

votes

1 answer

How to upgrade Slurm?

I've been asked to upgrade our Slurm Workload Manager installation. I have a slurm 2.3.4 on a Debian 7.0 wheezy cluster (1 master + 8 nodes). I've not installed it so I'm a bit confused about how to do this and how to proceed without destroying…

cluster slurm

asked Oct 03 '17 at 13:31

Sasha Grievus

233

2

votes

3 answers

Setting up a cluster with workload distribution

I want to setup a server cluster which can keep by servers as busy as possible while still giving fair compute time to everyone. I have setup a basic Kubernetes setup but the issue is that if some user releases a pod which can parallelize upto say…

linux load-balancing kubernetes cluster slurm

asked Feb 09 '24 at 16:56

starhawk

21

2

votes

2 answers

Can I release stale allocated GRES on a Slurm node?

Is there any way to clear stale allocated GRES in Slurm? I have one node where 4 GPUs are allocated while no jobs are running on the node. Rebooting the node does not release the GPUs. user@control1:~$ scontrol show node node2 NodeName=node2…

slurm

asked Sep 19 '23 at 13:30

Gerald Schneider

26,582
8
65
97

2

votes

1 answer

Slurm Config for limit the endtime for all new jobs on the cluster to a certain date

Our cluster has to be shutdown for an update in two weeks. We would like to let users use the cluster until the last day, but we want to make sure, no job can be started, which would end after the shutdown date. Is there an easy way to limit the…

slurm

asked May 23 '22 at 14:02

stupidstudent

123

2

votes

1 answer

How can I set up interactive-job-only or batch-job-only partition on a SLURM cluster?

I'm managing a PBS/torque HPC cluster, and now I'm setting up another cluster with SLURM. On the PBS cluster, I can set a queue to accept only interactive jobs by qmgr -c "set queue interactive_q disallowed_types = batch" and to accept only batch…

hpc slurm torque pbs

asked Jan 21 '22 at 05:44

wdg

153

Questions tagged [slurm]