Specifying Compute Resources in WDL Tasks

Use the following tables to help figure out how to configure your runtime{} section.

How to Allocate Resources in your Runtime Section

HTCondor is the back end to Cromwell and is responsible for grabbing the appropriatly sized resource from slurm for each wdl-task. HTCondor can determine what resource your task needs from only memory and cpu which is set in the runtime{} section. In fact, memory and cpu have defaults set to “2G” and 1(threads), respectively, so you don’t have to include them but it is advised for reproducibility.

Note

Inside your runtime{} section of the WDL, cpu key should be set to threads and not cpus, despite the name, because HTCondor expects that value to be threads.

Table of available resources

Site	Type	#Nodes	Mem (GB)*	Minutes	#Threads	#GPUs
Perlmutter (NERSC)	Large	3072	492	2865	256	0
	GPU (4x NVIDIA A100 (40GB))	1536	256	2865	128	4
	GPU (4x NVIDIA A100 (80GB))	256	256	2865	128	4
JGI (Lab-IT)	Large	8	492	4305	32	0
Dori (Lab-IT)	Large	100	492	4305	64	0
Dori (Lab-IT)	Xlarge	16	1980	20160	128	0
Tahoma (EMSL)	Medium	184	364	2865	36	0
	Xlarge	24	1480	2865	36	0
	GPU (2x NVIDIA Tesla V100 32GB)	24	1480	2865	36	2
Crux (ALCF)	Medium	256	256	1425	256	0
Defiant (OLCF)	Medium	20	492	4305	128	0

Note: Defiant site is still under testing and may not be available for production use yet.

Memory Overhead

This number is the gigabytes you can actually use because of overhead. For example, on dori, a “large” node is advertized at 512G but since there is overhead, we will reserve 20G and instead ask for 492G in our WDL.

Time Overhead

When Cromwell submits a task, HTCondor manages job scheduling by checking the queue for available resources. The JAWS Pool Manager monitors HTCondor and, when needed, requests new Slurm nodes. Once a compute node is available, HTCondor submits the task.

Due to a slight delay (a few seconds) in resource allocation, we build in a time buffer to ensure jobs get the full requested time. For example, instead of requesting the maximum 48 hours on Perlmutter, we request 47 hours and 45 minutes to account for the delay.

Links to documentation about each cluster:

Note

Remember that in your runtime{} section, the number you give cpu: is interpreted by HTCondor to be threads not cpu.

GPU Resources

Attention

📘 For comprehensive GPU documentation, see the GPU Usage Guide.

The GPU guide includes: runtime stanza reference, container requirements, troubleshooting, WDL examples, and best practices.

JAWS supports GPU-enabled tasks at the following sites:

Perlmutter (NERSC): NVIDIA A100 GPUs
Tahoma (EMSL): NVIDIA Tesla V100 GPUs

See the Table of available resources above for GPU node specifications and memory details.

Quick Start: GPU Runtime Stanza

Minimal GPU configuration (recommended starting point):

runtime {
  docker: "pytorch/pytorch:latest"  # Must be CUDA-enabled
  memory: "16GiB"
  cpu: 4
  gpu: true                         # Enable GPU (defaults to 1)
  runtime_minutes: 60
}

GPU Runtime Attributes

Attribute	Type	Default	Description
`gpu`	Boolean	`false`	Set to `true` to enable GPU allocation. Required for GPU access.
`gpuCount`	Integer	`1`	Number of GPUs to request. Only applies when `gpu: true`. Most tasks should use 1.

Default behavior: When gpu: true is set without gpuCount, JAWS allocates 1 GPU.

# These are equivalent:
runtime { gpu: true }
runtime { gpu: true, gpuCount: 1 }

Warning

Requesting gpuCount > 1 does not automatically parallelize your code. Your application must explicitly use multi-GPU frameworks (e.g., PyTorch DistributedDataParallel, Horovod).

Key Notes:

gpu: true enables GPU resource allocation for the task.
gpuCount: 3 requests 3 GPUs from the scheduler. If gpuCount is not specified, JAWS will request 1 GPU by default when gpu: true.
Ensure the Docker container used supports GPU execution (e.g., pytorch/pytorch:latest includes CUDA).