HTCondor Backend Configuration Options When Creating WDL
Use the following tables to help figure out how to configure your runtime{}
section.
How to Allocate Resources in your Runtime Section
HTCondor is the back end to Cromwell and is responsible for grabbing the appropriatly sized resource from slurm for each wdl-task.
HTCondor can determine what resource your task needs from only memory
and cpu
which is set in the runtime{}
section.
In fact, memory
and cpu
have defaults set to “2G” and 1(threads), respectively, so you don’t have to include them but it is advised for reproducibility.
Note
Inside your runtime{}
section of the WDL, cpu
key should be set to threads
and not cpus, despite the name, because HTCondor expects that value to be threads
.
Table of available resources
Site |
Type |
#Nodes |
Mem (GB)* |
Minutes |
#Threads |
---|---|---|---|---|---|
Perlmutter (NERSC) |
Large |
3072 |
492 |
2865 |
256 |
JGI (Lab-IT) |
Large |
8 |
492 |
4305 |
32 |
Dori (Lab-IT) |
Large |
100 |
492 |
4305 |
64 |
Xlarge |
18 |
1480 |
20160 |
72 |
|
Tahoma (EMSL) |
Medium |
184 |
364 |
2865 |
36 |
Xlarge |
24 |
1480 |
2865 |
36 |
|
Defiant (OLCF) |
Medium |
36 |
256 |
1425 |
128 |
Note: Defiant site is not available yet.
Memory Overhead
This number is the gigabytes you can actually use because of overhead. For example, on dori, a “large” node is advertized at 512G but since there is overhead, we will reserve 20G and instead ask for 492G in our WDL.
Time Overhead
When Cromwell submits a task, HTCondor manages job scheduling by checking the queue for available resources. The JAWS Pool Manager monitors HTCondor and, when needed, requests new Slurm nodes. Once a compute node is available, HTCondor submits the task.
Due to a slight delay (a few seconds) in resource allocation, we build in a time buffer to ensure jobs get the full requested time. For example, instead of requesting the maximum 48 hours on Perlmutter, we request 47 hours and 45 minutes to account for the delay.
Links to documentation about each cluster:
Note
Remember that in your runtime{}
section, the number you give cpu:
is interpreted by HTCondor to be threads not cpu.
GPU Resources
JAWS now supports GPU-enabled tasks at NERSC for the following site:
Perlmutter
How to Request GPU Resources in WDL
To enable GPU support in a JAWS-run workflow, set gpu: true in the task’s runtime block. Here’s an example:
version 1.0
workflow GPU_Test_Workflow {
call GPU_Test_Task
}
task GPU_Test_Task {
command {
echo "Testing GPU runtime capabilities"
python3 -c "import torch; print('CUDA Available:', torch.cuda.is_available()); print('Number of GPUs:', torch.cuda.device_count()); print('1 GPU Available:', torch.cuda.device_count() >= 1)"
}
output {
File gpu_test_output = stdout()
}
runtime {
docker: "pytorch/pytorch:latest"
memory: "1GiB"
cpu: 1
gpu: true
runtime_minutes: 10
}
}
Key Notes:
gpu: true enables GPU resource allocation for the task.
JAWS will request 1 GPU node via the scheduler at the selected compute site.
Ensure the Docker container used supports GPU execution (e.g., pytorch/pytorch:latest includes CUDA).