Specifying Compute Resources in WDL Tasks

Use the following tables to help figure out how to configure your runtime{} section.

How to Allocate Resources in your Runtime Section

HTCondor is the back end to Cromwell and is responsible for grabbing the appropriatly sized resource from slurm for each wdl-task. HTCondor can determine what resource your task needs from only memory and cpu which is set in the runtime{} section. In fact, memory and cpu have defaults set to “2G” and 1(threads), respectively, so you don’t have to include them but it is advised for reproducibility.

Note

Inside your runtime{} section of the WDL, cpu key should be set to threads and not cpus, despite the name, because HTCondor expects that value to be threads.

Table of available resources

Site

Type

#Nodes

Mem (GB)*

Minutes

#Threads

#GPUs

Perlmutter (NERSC)

Large

3072

492

2865

256

0

GPU (4x NVIDIA A100 (40GB))

1536

256

2865

128

4

GPU (4x NVIDIA A100 (80GB))

256

256

2865

128

4

JGI (Lab-IT)

Large

8

492

4305

32

0

Dori (Lab-IT)

Large

100

492

4305

64

0

Xlarge

16

1980

20160

128

0

Tahoma (EMSL)

Medium

184

364

2865

36

0

Xlarge

24

1480

2865

36

0

GPU (2x NVIDIA Tesla V100 32GB)

24

1480

2865

36

2

Crux (ALCF)

Medium

256

256

1425

256

0

Defiant (OLCF)

Medium

20

492

4305

128

0

Note: Defiant site is still under testing and may not be available for production use yet.

Memory Overhead

This number is the gigabytes you can actually use because of overhead. For example, on dori, a “large” node is advertized at 512G but since there is overhead, we will reserve 20G and instead ask for 492G in our WDL.

Time Overhead

When Cromwell submits a task, HTCondor manages job scheduling by checking the queue for available resources. The JAWS Pool Manager monitors HTCondor and, when needed, requests new Slurm nodes. Once a compute node is available, HTCondor submits the task.

Due to a slight delay (a few seconds) in resource allocation, we build in a time buffer to ensure jobs get the full requested time. For example, instead of requesting the maximum 48 hours on Perlmutter, we request 47 hours and 45 minutes to account for the delay.

GPU Resources

Attention

📘 For comprehensive GPU documentation, see the GPU Usage Guide.

The GPU guide includes: runtime stanza reference, container requirements, troubleshooting, WDL examples, and best practices.

JAWS supports GPU-enabled tasks at the following sites:

  • Perlmutter (NERSC): NVIDIA A100 GPUs

  • Tahoma (EMSL): NVIDIA Tesla V100 GPUs

See the Table of available resources above for GPU node specifications and memory details.

Quick Start: GPU Runtime Stanza

Minimal GPU configuration (recommended starting point):

runtime {
  docker: "pytorch/pytorch:latest"  # Must be CUDA-enabled
  memory: "16GiB"
  cpu: 4
  gpu: true                         # Enable GPU (defaults to 1)
  runtime_minutes: 60
}

GPU Runtime Attributes

Attribute

Type

Default

Description

gpu

Boolean

false

Set to true to enable GPU allocation. Required for GPU access.

gpuCount

Integer

1

Number of GPUs to request. Only applies when gpu: true. Most tasks should use 1.

Default behavior: When gpu: true is set without gpuCount, JAWS allocates 1 GPU.

# These are equivalent:
runtime { gpu: true }
runtime { gpu: true, gpuCount: 1 }

Warning

Requesting gpuCount > 1 does not automatically parallelize your code. Your application must explicitly use multi-GPU frameworks (e.g., PyTorch DistributedDataParallel, Horovod).

Key Notes:

  • gpu: true enables GPU resource allocation for the task.

  • gpuCount: 3 requests 3 GPUs from the scheduler. If gpuCount is not specified, JAWS will request 1 GPU by default when gpu: true.

  • Ensure the Docker container used supports GPU execution (e.g., pytorch/pytorch:latest includes CUDA).