========================================= Specifying Compute Resources in WDL Tasks ========================================= .. role:: bash(code) :language: bash .. role:: red Use the following tables to help figure out how to configure your :bash:`runtime{}` section. How to Allocate Resources in your Runtime Section ================================================= HTCondor is the back end to Cromwell and is responsible for grabbing the appropriatly sized resource from slurm for each wdl-task. HTCondor can determine what resource your task needs from only :bash:`memory` and :bash:`cpu` which is set in the :bash:`runtime{}` section. In fact, :bash:`memory` and :bash:`cpu` have defaults set to "2G" and 1(threads), respectively, so you don't have to include them but it is advised for reproducibility. .. note:: Inside your :bash:`runtime{}` section of the WDL, :bash:`cpu` key should be set to :bash:`threads` and not cpus, despite the name, because HTCondor expects that value to be :bash:`threads`. .. _table-of-available-resources: Table of available resources ============================ +---------------------+--------------------------------------+--------+-----------+---------+---------+--------+ | Site | Type | #Nodes | Mem (GB)* | Minutes | #Threads| #GPUs | +=====================+======================================+========+===========+=========+=========+========+ | Perlmutter (NERSC) | Large | 3072 | 492 | 2865 | 256 | 0 | + +--------------------------------------+--------+-----------+---------+---------+--------+ | | GPU (4x NVIDIA A100 (40GB)) | 1536 | 256 | 2865 | 128 | 4 | + +--------------------------------------+--------+-----------+---------+---------+--------+ | | GPU (4x NVIDIA A100 (80GB)) | 256 | 256 | 2865 | 128 | 4 | +---------------------+--------------------------------------+--------+-----------+---------+---------+--------+ | JGI (Lab-IT) | Large | 8 | 492 | 4305 | 32 | 0 | +---------------------+--------------------------------------+--------+-----------+---------+---------+--------+ | Dori (Lab-IT) | Large | 100 | 492 | 4305 | 64 | 0 | + +--------------------------------------+--------+-----------+---------+---------+--------+ | | Xlarge | 16 | 1980 | 20160 | 128 | 0 | +---------------------+--------------------------------------+--------+-----------+---------+---------+--------+ | Tahoma (EMSL) | Medium | 184 | 364 | 2865 | 36 | 0 | + +--------------------------------------+--------+-----------+---------+---------+--------+ | | Xlarge | 24 | 1480 | 2865 | 36 | 0 | + +--------------------------------------+--------+-----------+---------+---------+--------+ | | GPU (2x NVIDIA Tesla V100 32GB) | 24 | 1480 | 2865 | 36 | 2 | +---------------------+--------------------------------------+--------+-----------+---------+---------+--------+ | Crux (ALCF) | Medium | 256 | 256 | 1425 | 256 | 0 | +---------------------+--------------------------------------+--------+-----------+---------+---------+--------+ | Defiant (OLCF) | Medium | 20 | 492 | 4305 | 128 | 0 | +---------------------+--------------------------------------+--------+-----------+---------+---------+--------+ Note: `Defiant` site is still under testing and may not be available for production use yet. .. admonition:: Memory Overhead This number is the gigabytes you can actually use because of overhead. For example, on dori, a "large" node is advertized at 512G but since there is overhead, we will reserve 20G and instead ask for 492G in our WDL. .. admonition:: Time Overhead When Cromwell submits a task, HTCondor manages job scheduling by checking the queue for available resources. The JAWS Pool Manager monitors HTCondor and, when needed, requests new Slurm nodes. Once a compute node is available, HTCondor submits the task. Due to a slight delay (a few seconds) in resource allocation, we build in a time buffer to ensure jobs get the full requested time. For example, instead of requesting the maximum 48 hours on Perlmutter, we request 47 hours and 45 minutes to account for the delay. Links to documentation about each cluster: ------------------------------------------ * `Dori cluster `_ * `Perlmutter cluster `_ * `Lawrencium cluster `_ * `Tahoma cluster `_ * `Defiant Cluster `_ * `Crux Cluster `_ .. note:: Remember that in your :bash:`runtime{}` section, the number you give :bash:`cpu:` is interpreted by HTCondor to be threads not cpu. GPU Resources ============= .. attention:: 📘 **For comprehensive GPU documentation**, see the :doc:`GPU Usage Guide `. The GPU guide includes: runtime stanza reference, container requirements, troubleshooting, WDL examples, and best practices. JAWS supports GPU-enabled tasks at the following sites: - **Perlmutter (NERSC)**: NVIDIA A100 GPUs - **Tahoma (EMSL)**: NVIDIA Tesla V100 GPUs See the :ref:`table-of-available-resources` above for GPU node specifications and memory details. Quick Start: GPU Runtime Stanza -------------------------------- **Minimal GPU configuration** (recommended starting point): .. code-block:: text runtime { docker: "pytorch/pytorch:latest" # Must be CUDA-enabled memory: "16GiB" cpu: 4 gpu: true # Enable GPU (defaults to 1) runtime_minutes: 60 } GPU Runtime Attributes ^^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 20 15 15 50 * - Attribute - Type - Default - Description * - :bash:`gpu` - Boolean - :bash:`false` - Set to :bash:`true` to enable GPU allocation. **Required** for GPU access. * - :bash:`gpuCount` - Integer - :bash:`1` - Number of GPUs to request. Only applies when :bash:`gpu: true`. **Most tasks should use 1.** **Default behavior**: When :bash:`gpu: true` is set without :bash:`gpuCount`, JAWS allocates **1 GPU**. .. code-block:: text # These are equivalent: runtime { gpu: true } runtime { gpu: true, gpuCount: 1 } .. warning:: Requesting :bash:`gpuCount > 1` does **not** automatically parallelize your code. Your application must explicitly use multi-GPU frameworks (e.g., PyTorch DistributedDataParallel, Horovod). Key Notes: ^^^^^^^^^^ - ``gpu: true`` enables GPU resource allocation for the task. - ``gpuCount: 3`` requests 3 GPUs from the scheduler. If ``gpuCount`` is not specified, JAWS will request **1 GPU** by default when ``gpu: true``. - Ensure the Docker container used supports GPU execution (e.g., `pytorch/pytorch:latest` includes CUDA).