================== How to build WDLs ================== .. role:: bash(code) :language: bash Summary ======= This is **Lesson 3** of the JAWS tutorial series. In :doc:`Lesson 1 ` you ran the BLAST workflow as plain bash. In :doc:`Lesson 2 ` you packaged the same commands into a Docker image. This lesson wraps those containerized commands into a **WDL workflow** that Cromwell (and JAWS) can execute. Three terms to keep straight: - **WDL** (Workflow Description Language) is what you write. It describes *which* commands to run, in what order, what inputs each step takes, and what outputs it produces. - **Cromwell** is the engine that *executes* WDL workflows. JAWS uses Cromwell internally; you can also run Cromwell directly on your laptop while developing. - A **task** is one unit of work (one ``runtime { docker: ... }`` block, one command). A **workflow** strings tasks together. By the end of this lesson you'll have a single ``blast.wdl`` file with three tasks, ``make_db``, ``run_blast``, and ``summarize``, that runs the same workflow end-to-end inside containers, plus a visual DAG of how the tasks connect. .. admonition:: Pre-requisites - Completed :doc:`Lesson 1: Local Development Environment ` and :doc:`Lesson 2: Docker Containers `. - You have the ``jaws-tutorial-examples`` repo cloned and the ``blast_example/`` Docker image either built locally or available on Docker Hub. - JAWS uses **WDL 1.0**. The WDL examples in this tutorial are all 1.0-compliant. WDL 1.1 syntax will not parse with the Cromwell version JAWS ships. A few useful references for later: * The official `OpenWDL 1.0 Specification `_ * `Real world examples on Dockstore `_ * Re-usable subworkflow tasks: `wdl-tasks `_ The Big Picture =============== The BLAST workflow you ran in Lessons 1 and 2 has three steps: .. code-block:: bash # 1. Build a BLAST database from the reference FASTA makeblastdb -in reference.fasta -dbtype nucl -out blastdb/ref # 2. Search the query FASTA against that database blastn -query query.fasta -db blastdb/ref -outfmt 6 -out hits.tsv # 3. Summarize: how many queries got at least one hit? nqueries=$(grep -c '^>' query.fasta) nhits=$(awk '{print $1}' hits.tsv | sort -u | wc -l) echo "$nhits of $nqueries query sequences had at least one BLAST hit." Each step becomes one WDL **task**. The output of one task becomes the input of the next, that's how Cromwell figures out the order to run them in. A WDL task has four sections: .. code-block:: bash task task_name { input { ... } # Files and parameters the task needs command <<< ... >>> # The bash commands that actually do the work output { ... } # Files the task produces (made available to later tasks) runtime { ... } # Docker image, CPU, memory, time limit } The runtime block is where Lesson 2's Docker image gets used. You wrote ``docker run --volume ... blast.sh ...`` by hand in Lesson 2; the WDL ``runtime { docker: ... }`` block does the equivalent automatically for every task, with Cromwell choosing Docker, Apptainer, or Shifter depending on the site. .. admonition:: Why this WDL doesn't use the image you built in Lesson 2 :class: note The tasks below use ``docker: "ncbi/blast:latest"``, the **official BLAST image** that NCBI publishes on Docker Hub. NCBI already maintains a public, well-maintained image with ``makeblastdb`` and ``blastn`` installed, so there's no reason for our workflow to depend on a one-off image we built ourselves. The image you built and pushed in Lesson 2 was a learning exercise: every JAWS user needs to know how to package a tool that *doesn't* already have a community-maintained image. Before you build your own, take a moment to check whether the third-party tool you want to run already has an official or community-maintained image on Docker Hub or Quay.io. If a well-maintained image exists, point your WDL's ``runtime { docker: ... }`` at it and skip the build entirely. Build your own when no good image exists, when you need a specific tool combination, or when you need to bundle a custom script alongside the tool. Task 1: ``make_db`` =================== The first task builds the BLAST database from a reference FASTA. It takes one input file and emits the resulting database files (multiple files share a prefix, so we collect them with ``glob``). .. code-block:: bash task make_db { input { File reference_fasta } command <<< mkdir -p blastdb makeblastdb \ -in ~{reference_fasta} \ -dbtype nucl \ -out blastdb/ref >>> output { Array[File] blast_db_files = glob("blastdb/ref.*") } runtime { docker: "ncbi/blast:latest" cpu: 1 memory: "1 GB" runtime_minutes: 5 } } Three things to notice: - ``~{reference_fasta}`` is how you interpolate a WDL variable into the command. The older ``${reference_fasta}`` syntax also works, but ``~{}`` is preferred because ``${}`` collides with bash variable expansion and produces hard-to-diagnose bugs. - ``Array[File]`` plus ``glob(...)`` is the standard pattern for "this task produces several files I don't know the exact names of." ``makeblastdb`` emits ``ref.nhr``, ``ref.nin``, ``ref.nsq``, and a few more, ``glob("blastdb/ref.*")`` collects them all into one list. - The ``runtime`` block points at the public ``ncbi/blast:latest`` image on Docker Hub. JAWS pulls this image automatically on whichever compute site the task lands. .. hint:: The ``command`` body can be wrapped in either curly braces ``{ ... }`` or triple angle braces ``<<< ... >>>``. The two differ in how they handle bash variable interpolation: +------------------------+----------------------------+ | Command body style | Placeholder syntax | +========================+============================+ | ``command { ... }`` | ``~{}`` (preferred) or | | | ``${}`` | +------------------------+----------------------------+ | ``command <<< ... >>>``| ``~{}`` **only** | +------------------------+----------------------------+ Prefer ``<<< ... >>>`` plus ``~{}``. It removes the ambiguity between WDL variables and bash variables, and is what every example in this lesson uses. Task 2: ``run_blast`` ===================== The second task searches the query FASTA against the database. Two new pieces appear: the task takes an ``Array[File]`` input (the database files from ``make_db``), and the command has a small bit of setup to put those files back into one directory. .. code-block:: bash task run_blast { input { File query_fasta Array[File] blast_db_files } command <<< # Cromwell stages each input file into its own directory. # blastn needs all the database files in one place to find them # with a single -db prefix, so we copy them into blastdb/. mkdir -p blastdb for f in ~{sep=' ' blast_db_files}; do cp "$f" blastdb/ done blastn \ -query ~{query_fasta} \ -db blastdb/ref \ -outfmt 6 \ -out hits.tsv # Always produce hits.tsv, even if there were no hits, so the # downstream task has a file to read. touch hits.tsv >>> output { File hits_tsv = "hits.tsv" } runtime { docker: "ncbi/blast:latest" cpu: 1 memory: "1 GB" runtime_minutes: 5 } } The ``~{sep=' ' blast_db_files}`` syntax expands an array into a space-separated string, which is what bash ``for`` loops expect. .. dropdown:: Why does Cromwell stage each file in its own directory? :ref:`🔗 ` :color: info :name: wdl-cromwell-staging :animate: fade-in Cromwell's default file localization strategy puts each input ``File`` into a separate subdirectory of the task's working directory. That keeps file names unique across inputs (two different inputs can both be called ``reads.fastq`` without colliding), but it also means a multi-file output from one task arrives at the next task as a list of paths, not as a directory. Most of the time you don't notice. The places it bites you: - Tools like ``blastn``, ``bwa``, and ``samtools`` that expect a *family* of files sharing a prefix (``ref.nhr``, ``ref.nin``, ``ref.nsq``, ...). The fix is the ``mkdir + cp`` pattern shown above, or symlinks (``ln -s``) if you'd rather not duplicate data. - Tools that hard-code looking for sidecar files (e.g. ``.bai`` next to a ``.bam``). Same fix. If your task ever errors out with "couldn't find ``X.idx``" or similar, this is almost always why. Task 3: ``summarize`` ===================== The third task reads the hits table and writes a one-line summary. It doesn't need BLAST, so it uses a lighter base image: .. code-block:: bash task summarize { input { File hits_tsv File query_fasta } command <<< # `xargs` collapses leading whitespace that BSD `wc` adds on macOS. # Harmless on GNU/Linux where `wc` doesn't pad. nqueries=$(grep -c '^>' ~{query_fasta} | xargs) nhits=$(awk '{print $1}' ~{hits_tsv} | sort -u | wc -l | xargs) echo "$nhits of $nqueries query sequences had at least one BLAST hit." > summary.txt cat summary.txt >>> output { File summary_txt = "summary.txt" } runtime { docker: "ubuntu:22.04" cpu: 1 memory: "512 MB" runtime_minutes: 5 } } Two things worth noticing: - A task can use a different Docker image from the rest of the workflow. ``summarize`` only needs ``grep``, ``awk``, ``sort``, and ``wc``, all of which ship with stock Ubuntu, so there's no reason to drag in the ~1 GB BLAST image for this step. - The ``set -eo pipefail`` idiom is also worth knowing for ``command`` blocks that use pipes. Without it, a failure in the middle of a pipeline is silently swallowed (the exit code of the last command wins), which makes debugging much harder. We omitted it here for readability, but in production workflows it's a good default. Workflow Definition =================== Now you connect the three tasks. The workflow block lists the inputs the *workflow* takes, then calls each task and wires outputs to inputs. .. code-block:: bash version 1.0 workflow blast_example { input { File reference_fasta File query_fasta } call make_db { input: reference_fasta = reference_fasta } call run_blast { input: query_fasta = query_fasta, blast_db_files = make_db.blast_db_files } call summarize { input: hits_tsv = run_blast.hits_tsv, query_fasta = query_fasta } output { File hits_tsv = run_blast.hits_tsv File summary_txt = summarize.summary_txt } } The wiring is in the ``input:`` lines of each ``call``: - ``make_db.blast_db_files`` refers to the output ``blast_db_files`` of the ``make_db`` task. This is how Cromwell figures out that ``run_blast`` depends on ``make_db`` and must wait for it. - ``run_blast.hits_tsv`` plugs into ``summarize``. Same idea. Cromwell builds the dependency graph from these references. Tasks with no data dependency on each other run in **parallel** automatically, you don't have to ask for it. .. note:: Tasks are defined *outside* the workflow block; ``call`` statements live *inside* it. WDL lets you put the tasks before or after the workflow block; for readability, this tutorial puts the workflow first. .. note:: The very first line, ``version 1.0``, is required (see the prerequisites note on WDL 1.0). Putting It All Together ======================= Combine the workflow block and the three tasks into a single file, ``blast.wdl``: .. code-block:: bash version 1.0 workflow blast_example { input { File reference_fasta File query_fasta } call make_db { input: reference_fasta = reference_fasta } call run_blast { input: query_fasta = query_fasta, blast_db_files = make_db.blast_db_files } call summarize { input: hits_tsv = run_blast.hits_tsv, query_fasta = query_fasta } output { File hits_tsv = run_blast.hits_tsv File summary_txt = summarize.summary_txt } } task make_db { input { File reference_fasta } command <<< mkdir -p blastdb makeblastdb \ -in ~{reference_fasta} \ -dbtype nucl \ -out blastdb/ref >>> output { Array[File] blast_db_files = glob("blastdb/ref.*") } runtime { docker: "ncbi/blast:latest" cpu: 1 memory: "1 GB" runtime_minutes: 5 } } task run_blast { input { File query_fasta Array[File] blast_db_files } command <<< mkdir -p blastdb for f in ~{sep=' ' blast_db_files}; do cp "$f" blastdb/ done blastn \ -query ~{query_fasta} \ -db blastdb/ref \ -outfmt 6 \ -out hits.tsv touch hits.tsv >>> output { File hits_tsv = "hits.tsv" } runtime { docker: "ncbi/blast:latest" cpu: 1 memory: "1 GB" runtime_minutes: 5 } } task summarize { input { File hits_tsv File query_fasta } command <<< nqueries=$(grep -c '^>' ~{query_fasta} | xargs) nhits=$(awk '{print $1}' ~{hits_tsv} | sort -u | wc -l | xargs) echo "$nhits of $nqueries query sequences had at least one BLAST hit." > summary.txt cat summary.txt >>> output { File summary_txt = "summary.txt" } runtime { docker: "ubuntu:22.04" cpu: 1 memory: "512 MB" runtime_minutes: 5 } } The complete file is also already in the example repo at ``jaws-tutorial-examples/blast_example/blast.wdl`` if you'd rather diff against it than retype. Validate ======== Before running the workflow, check it parses cleanly. There are two validators you'll use at different stages. Validate locally with ``miniwdl check`` --------------------------------------- While you're iterating on the WDL on your laptop, the fastest feedback loop is `miniwdl `_, an independent WDL parser. It's a single ``pip install`` and reports syntax and basic type errors in well under a second. .. code-block:: bash pip install miniwdl miniwdl check blast.wdl A clean run prints something like ``SUCCESS`` plus a brief summary of the workflow's inputs, calls, and outputs. Any syntax error is reported with a file and line number. Validate against JAWS with ``jaws validate`` -------------------------------------------- Before you submit a workflow to JAWS, run ``jaws validate``. This uses the **same WDL parser JAWS will use** to actually run the workflow, so it catches issues that ``miniwdl`` might miss (typically ones related to specific Cromwell-version quirks). .. code-block:: bash # On Dori, after activating the JAWS environment module load jaws jaws validate blast.wdl # Workflow is OK If ``miniwdl`` is happy and ``jaws validate`` complains, the failure is almost always a WDL 1.1 feature creeping into your file (Cromwell ships with 1.0). Common culprits: ``struct`` definitions, ``hints`` blocks, or the ``input { File? optional_file = None }`` syntax. .. dropdown:: ``miniwdl check`` succeeds but ``jaws validate`` fails :ref:`🔗 ` :color: info :name: wdl-miniwdl-vs-jaws-validate :animate: fade-in The most common cause is WDL 1.1 syntax sneaking into a file declared ``version 1.0``. ``miniwdl`` is lenient and parses several 1.1 constructs even in 1.0 files; the Cromwell version JAWS uses is stricter. Things to check: - **No ``struct`` definitions.** ``struct`` is 1.1-only. Use a small number of typed inputs instead, or pass them as a ``Map[String, String]`` (1.0 supports that). - **No ``hints { ... }`` block.** That's 1.1. Put cache/runtime hints in the ``runtime`` block in 1.0. - **Optional inputs use ``File?`` syntax, not ``File? = None``.** The ``= None`` default is 1.1. - **No ``input { Directory dir }`` type.** ``Directory`` is 1.1. Use ``Array[File]`` with ``glob`` instead. If you've ruled all of those out and ``jaws validate`` still fails, paste the error in ``#jaws`` on Slack with the WDL attached. Real Cromwell bugs do happen; the team is happy to look. Execute Locally =============== You can run Cromwell directly on your laptop to make sure the WDL works end-to-end before submitting it to JAWS. This requires Docker (Lesson 2), a Java runtime (Cromwell is a JAR file), and a copy of Cromwell itself. If you don't already have Java, install a current LTS — OpenJDK 17 or 21 both work with Cromwell 87. The simplest paths: .. code-block:: bash # macOS (Homebrew) brew install openjdk@17 # Ubuntu / Debian sudo apt install openjdk-17-jre # Verify java -version JAWS ships **Cromwell 87**, so use that version locally too. Mismatched Cromwell versions are a common source of "works on my laptop, fails on JAWS" bugs. .. code-block:: bash # Grab the matching Cromwell jar (one-time) wget https://github.com/broadinstitute/cromwell/releases/download/87/cromwell-87.jar # Run from inside blast_example/ so the relative paths in inputs.json resolve cd jaws-tutorial-examples/blast_example java -jar /path/to/cromwell-87.jar run blast.wdl -i inputs.json You'll see Cromwell start each task, pull the ``ncbi/blast`` and ``ubuntu`` images on the first run (subsequent runs reuse them), and finish with a JSON block listing the workflow outputs. The output files live under ``cromwell-executions/blast_example//call-/execution/``. Each task gets its own directory containing ``stdout``, ``stderr``, ``script``, and any files it produced. When something fails, ``stderr`` is the first thing to read. .. note:: ``inputs.json`` is the file that tells Cromwell which files to feed into the workflow. It's covered in detail in :doc:`Lesson 4: Defining the Input Data `. For now, the one in ``blast_example/`` already points at the sample reference and query FASTAs, so the ``cromwell run`` command above works as-is. Visualize Your Workflow ======================= Once the WDL grows past three or four tasks, a picture is easier to reason about than text. The Broad Institute's `WOMtool `_ generates a Graphviz dot file from any WDL workflow. Install dependencies: .. code-block:: bash # Download the matching WOMtool jar (Cromwell 87) wget https://github.com/broadinstitute/cromwell/releases/download/87/womtool-87.jar # Graphviz provides the `dot` command brew install graphviz # macOS sudo apt install graphviz # Linux Generate the diagram: .. code-block:: bash java -jar womtool-87.jar graph blast.wdl > blast.dot dot -Tpng blast.dot -o blast.png open blast.png # macOS; use `xdg-open` on Linux For the BLAST workflow you'll see three nodes, ``make_db``, ``run_blast``, ``summarize``, with arrows from ``make_db`` to ``run_blast`` and from ``run_blast`` to ``summarize``. ``summarize`` also has an arrow from the workflow's ``query_fasta`` input. .. figure:: /Figures/blast.png :class: with-shadow :scale: 100% :align: center DAG of the ``blast_example`` workflow generated by WOMtool + Graphviz. The DAG is generated entirely from the data dependencies you wrote into the ``call`` blocks, no extra annotation required. If the arrows in your DAG don't match the order you expected, that's a clue your WDL's ``input:`` wiring isn't doing what you thought. Troubleshooting =============== .. dropdown:: "Failed to evaluate input 'X'": Cromwell rejects an input expression :ref:`🔗 ` :color: info :name: wdl-input-evaluation :animate: fade-in Usually means a typo in a ``call`` block's ``input:`` line. Cromwell reports the task name and the input name, so start there. The two most frequent causes: - Referring to a task output that doesn't exist (``call run_blast { input: blast_db = make_db.blastdb }`` when the actual output name is ``blast_db_files``). - Forgetting that task outputs are referenced as ``.``, not ``.``. Run ``miniwdl check blast.wdl`` first, it catches most of these before Cromwell sees them. .. dropdown:: "Cromwell ran the task but my output file is empty / missing" :ref:`🔗 ` :color: info :name: wdl-empty-output :animate: fade-in The command ran but didn't produce the file the ``output`` block declared. Look at ``cromwell-executions///call-/execution/``: - ``stderr`` shows what the command printed; this is where a missing tool or "file not found" error will surface. - ``script`` is the actual shell script Cromwell generated and ran. If a WDL variable interpolation went wrong, you'll see it here as the wrong literal value. - ``stdout`` is whatever the command printed to stdout. Often empty unless your task echoes things. A common cause: the command writes to a *different* filename than the ``output { File foo = "actualname.txt" }`` block declares. Cromwell looks for the exact filename you wrote; case and spelling have to match. .. dropdown:: ``runtime { docker: ... }`` image fails to pull during a JAWS run :ref:`🔗 ` :color: info :name: wdl-docker-pull-failed :animate: fade-in A few things to check: - The image is **public** on Docker Hub (or in a registry JAWS has credentials for). Private Docker Hub images need to be in an organization JAWS has been given access to, ask in ``#jaws`` if you need that set up. - The image **tag still exists**. ``:latest`` and other moving tags can disappear. If reproducibility matters (it usually does), pin the SHA256 digest as described in Lesson 2's "Prefer SHA256 Digests Over Tags" section. - For JGI GitLab registry images (``library.jgi.doe.gov:5050/...``), the image pulls fine but **call-caching is disabled** for them, see Lesson 2 Step 4. What's Next =========== You now have a working WDL workflow that runs end-to-end inside containers. Next: - :doc:`Lesson 4: Defining the Input Data `, which covers the ``inputs.json`` file in detail, file paths vs. URLs, lists and maps, when to use ``/fast_scratch``, and how JAWS caches your input files between runs. See also ======== - `OpenWDL 1.0 Specification `_, the authoritative reference for the language. - `wdl-tasks `_, a JGI library of reusable subworkflow tasks you can ``import`` rather than re-write. - :doc:`../Resources/best_practices`, recommended patterns for WDL workflows that scale on JAWS.