3

I have SLURM setup on a single CentOS 7 node with 64 cores (128 CPU's). I have been using SLURM to submit jobs successfully using both srun and sbatch. However, it is with the caveat that I don't allocate memory. I can allocate CPU's, but not memory.

When I try to allocate memory I get

sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available

So this will run

#!/bin/bash
#SBATCH --job-name=name
#SBATCH --output=name.txt
#SBATCH --cpus-per-task=10
#SBATCH --time=6-59:00

But this will not run

#!/bin/bash
#SBATCH --job-name=name
#SBATCH --output=name.txt
#SBATCH --cpus-per-task=10
#SBATCH --mem=2000M
#SBATCH --time=6-59:00

similarly this won't run

#!/bin/bash
#SBATCH --job-name=name
#SBATCH --output=name.txt
#SBATCH --cpus-per-task=10
#SBATCH --mem-per-cpu=2000M
#SBATCH --time=6-59:00

Both give the above error message.

This is a pain because now that I am starting to max out the cpu usage, I am having jobs clash and fail, and I believe it is because there isn't proper memory allocation, so programs will crash with bad alloc error messages, or just stop running. I have used SLURM quite abit on compute canada clusters, and assigning memory was no issue. Is the problem that I am running SLURM on a single node which is also the login node? or that I am essentially using default settings and need to do some admin work?

I have tried using different units for memory such as 2G rather than 2000M and I have tried using 1024M as well, but to no avail.

The slurm.conf file is

ClusterName=linux
ControlMachine=dummyname

ControlAddr=dummyaddress #BackupController= #BackupAddr=

#SlurmUser=slurm SlurmdUser=root SlurmctldPort=dummyport SlurmdPort=dummyport+1 AuthType=auth/munge #JobCredentialPrivateKey= #JobCredentialPublicCertificate= StateSaveLocation=/var/lib/slurm SlurmdSpoolDir=/tmp/slurmd SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid ProctrackType=proctrack/pgid #PluginDir= #FirstJobId= ReturnToService=1 #MaxJobCount= #PlugStackConfig= #PropagatePrioProcess= #PropagateResourceLimits= #PropagateResourceLimitsExcept= #Prolog= #Epilog= #SrunProlog= #SrunEpilog= #TaskProlog= #TaskEpilog= #TaskPlugin= #TrackWCKey=no #TreeWidth=50 #TmpFS= #UsePAM=

TIMERS

SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0

SCHEDULING

SchedulerType=sched/backfill #SchedulerAuth= #SchedulerPort= #SchedulerRootFilter= SelectType=select/cons_res SelectTypeParameters=CR_CORE #FastSchedule=1 #PriorityType=priority/multifactor #PriorityDecayHalfLife=14-0 #PriorityUsageResetPeriod=14-0 #PriorityWeightFairshare=100000 #PriorityWeightAge=1000 #PriorityWeightPartition=10000 #PriorityWeightJobSize=1000 #PriorityMaxAge=1-0

LOGGING

#DebugFlags=gres SlurmctldDebug=3 #SlurmctldLogFile= SlurmdDebug=3 #SlurmdLogFile= JobCompType=jobcomp/none #JobCompLoc=

ACCOUNTING

#JobAcctGatherType=jobacct_gather/linux #JobAcctGatherFrequency=30

#AccountingStorageType=accounting_storage/slurmdbd #AccountingStorageHost= #AccountingStorageLoc= #AccountingStoragePass= #AccountingStorageUser=

COMPUTE NODES

GresTypes=gpu NodeName=dummyname CoresPerSocket=64 Sockets=1 ThreadsPerCore=2 State=IDLE Gres=gpu:2 #NodeName=dummyname CoresPerSocket=64 Sockets=1 ThreadsPerCore=2 State=IDLE PartitionName=all Nodes=dummyname Default=YES Shared=Yes MaxTime=INFINITE State=UP

Wesley
  • 81

0 Answers0