Specifying Compute Resources in WDL Tasks
Use the following tables to help figure out how to configure your runtime{} section.
How to Allocate Resources in your Runtime Section
HTCondor is the back end to Cromwell and is responsible for grabbing the appropriatly sized resource from slurm for each wdl-task.
HTCondor can determine what resource your task needs from only memory and cpu which is set in the runtime{} section.
In fact, memory and cpu have defaults set to “2G” and 1(threads), respectively, so you don’t have to include them but it is advised for reproducibility.
Note
Inside your runtime{} section of the WDL, cpu key should be set to threads and not cpus, despite the name, because HTCondor expects that value to be threads.
Table of available resources
Site |
Type |
#Nodes |
Mem (GB)* |
Minutes |
#Threads |
#GPUs |
|---|---|---|---|---|---|---|
Perlmutter (NERSC) |
Large |
3072 |
492 |
2865 |
256 |
0 |
GPU (4x NVIDIA A100 (40GB)) |
1536 |
256 |
2865 |
128 |
4 |
|
GPU (4x NVIDIA A100 (80GB)) |
256 |
256 |
2865 |
128 |
4 |
|
JGI (Lab-IT) |
Large |
8 |
492 |
4305 |
32 |
0 |
Dori (Lab-IT) |
Large |
100 |
492 |
4305 |
64 |
0 |
Xlarge |
16 |
1980 |
20160 |
128 |
0 |
|
Tahoma (EMSL) |
Medium |
184 |
364 |
2865 |
36 |
0 |
Xlarge |
24 |
1480 |
2865 |
36 |
0 |
|
GPU (2x NVIDIA Tesla V100 32GB) |
24 |
1480 |
2865 |
36 |
2 |
|
Crux (ALCF) |
Medium |
256 |
256 |
1425 |
256 |
0 |
Defiant (OLCF) |
Medium |
20 |
492 |
4305 |
128 |
0 |
Note: Defiant site is still under testing and may not be available for production use yet.
Memory Overhead
This number is the gigabytes you can actually use because of overhead. For example, on dori, a “large” node is advertized at 512G but since there is overhead, we will reserve 20G and instead ask for 492G in our WDL.
Time Overhead
When Cromwell submits a task, HTCondor manages job scheduling by checking the queue for available resources. The JAWS Pool Manager monitors HTCondor and, when needed, requests new Slurm nodes. Once a compute node is available, HTCondor submits the task.
Due to a slight delay (a few seconds) in resource allocation, we build in a time buffer to ensure jobs get the full requested time. For example, instead of requesting the maximum 48 hours on Perlmutter, we request 47 hours and 45 minutes to account for the delay.
Links to documentation about each cluster:
Note
Remember that in your runtime{} section, the number you give cpu: is interpreted by HTCondor to be threads not cpu.
GPU Resources
Attention
📘 For comprehensive GPU documentation, see the GPU Usage Guide.
The GPU guide includes: runtime stanza reference, container requirements, troubleshooting, WDL examples, and best practices.
JAWS supports GPU-enabled tasks at the following sites:
Perlmutter (NERSC): NVIDIA A100 GPUs
Tahoma (EMSL): NVIDIA Tesla V100 GPUs
See the Table of available resources above for GPU node specifications and memory details.
Quick Start: GPU Runtime Stanza
Minimal GPU configuration (recommended starting point):
runtime {
docker: "pytorch/pytorch:latest" # Must be CUDA-enabled
memory: "16GiB"
cpu: 4
gpu: true # Enable GPU (defaults to 1)
runtime_minutes: 60
}
GPU Runtime Attributes
Attribute |
Type |
Default |
Description |
|---|---|---|---|
|
Boolean |
|
Set to |
|
Integer |
|
Number of GPUs to request. Only applies when |
Default behavior: When gpu: true is set without gpuCount, JAWS allocates 1 GPU.
# These are equivalent:
runtime { gpu: true }
runtime { gpu: true, gpuCount: 1 }
Warning
Requesting gpuCount > 1 does not automatically parallelize your code. Your application must explicitly use multi-GPU frameworks (e.g., PyTorch DistributedDataParallel, Horovod).
Key Notes:
gpu: trueenables GPU resource allocation for the task.gpuCount: 3requests 3 GPUs from the scheduler. IfgpuCountis not specified, JAWS will request 1 GPU by default whengpu: true.Ensure the Docker container used supports GPU execution (e.g., pytorch/pytorch:latest includes CUDA).