================================ Best Practices for Creating WDLs ================================ .. role:: listsize .. role:: textborder .. role:: bash(code) There are opportunities to participate in code reviews with other WDL developers. :doc:`Contact us ` ---------------------- .. dropdown:: `set -euo pipefail` :color: info :animate: fade-in The :bash:`set -euo pipefail` command is actually a composition of three tests. For example: - use :bash:`set -e` to make your script exit when a command fails. - use :bash:`set -u` to exit when your script tries to use undeclared variables. - use :bash:`set -o pipefail` in scripts to catch failures in "cat myfile" in e.g. "cat myfile | grep id". Instead of the successful error code from grep id getting returned, we get a non-zero exit code from cat myfile - use :bash:`set -x` to trace what gets executed. Useful for debugging. `set -euo pipefile` can be useful when used as the first line within the `command <<< >>>` section of a WDL task. This command will help capture errors at the point where they occur in your unix code, rather than having the commands run beyond where the error happened, since this makes debugging more difficult. Another way of saying it is that, without :bash:`set -e`, the wdl-task will use the error code from the last command even if an ealier command failed. However, the :bash:`set -euo pipefail` command can cause the task to exit without any error printed stderr, so it is not always appropriate to use. .. dropdown:: Watch out for silent failures with `grep` :color: info :animate: fade-in **Silent Killers in WDL Tasks** Certain commands like :bash:`grep` can cause silent failures that are extremely difficult to troubleshoot. The :bash:`grep` command returns exit code 1 when it finds no matches, when using `set -e`, will cause your task to fail without any helpful error message. **The Problem:** .. code-block:: bash echo "Hello world" > file.txt grep bye file.txt echo $? # Returns 1, but with no error message in stderr, making debugging very difficult. If `grep` finds no matches, the task will fail with :bash:`rc=1` and no error message in `stderr`, making debugging very challenging. **The Simple Fix:** Add :bash:`|| true` to force exit code 0 even when no matches are found: .. code-block:: bash echo "Hello world" > file.txt grep bye file.txt || true echo $? **The Better Fix:** To distinguish between "no matches found" (`rc=1`) and actual errors (`rc > 1`, like missing input file), use: .. code-block:: bash echo "Hello world" > file.txt grep bye file.txt || if [ $? -gt 1 ]; then exit 1; fi This approach allows `rc=1` (no matches) but still exits on real errors (`rc > 1`). **Why This Happens:** - :bash:`grep` returns exit code `0` if matches are found - :bash:`grep` returns exit code `1` if no matches are found (this is standard behavior) - :bash:`grep` returns exit code `> 1` for actual errors (e.g., file not found) **Stay safe out there!** This pattern applies to other commands that use exit codes to indicate "not found" vs "error" conditions. .. dropdown:: Use Docker containers with SHA256 instead of tags :color: info :animate: fade-in - The running environment and required scripts should be encapsulated in a docker image. - The image should be pushed to `Docker Hub `_ and have a versioned Dockerfile. JAWS will pull images from there by default. - We recommend that a docker container be specified for every task; if not, the default container is Ubuntu. - It is recomended to reference containers by their SHA256 instead of tag (e.g. `doejgi/bbtools@sha256:64088..` instead of `doejgi/bbtools:latest`) for reproducability (a container can change and have the same tag). .. dropdown:: SHA Example :color: info :animate: fade-in .. code-block:: text # call-caching will not work runtime { "docker: ubuntu:20.04" } # call-caching will work runtime { "docker: ubuntu@sha256:47f14534bda344d9fe6ffd6effb95eefe579f4be0d508b7445cf77f61a0e5724" } # find the sha docker pull ubuntu:20.04 Digest: sha256:47f14534bda344d9fe6ffd6effb95eefe579f4be0d508b7445cf77f61a0e5724 # or docker inspect --format='{{.RepoDigests}}' ubuntu:20.04 ubuntu@sha256:47f14534bda344d9fe6ffd6effb95eefe579f4be0d508b7445cf77f61a0e5724 .. dropdown:: Avoid hard-coding paths in the WDL :color: info :animate: fade-in Paths to files or directories should be put into the `inputs.json` file, not the WDL. The exeption to this rule are docker images which `should` be hard-coded so the WDL contains information about the version of the docker container. .. dropdown:: WDL tasks should be self-sufficient :color: info :animate: fade-in 1. Imagine the WDL task as a wrapper script, it should be able to run independently of the pipeline. This means that a script should explicitly list all required input files as arguments and not assume some input files already exist in the current working directory. 2. Scripts should also specify output files as arguments and shouldn't write them somewhere other than the current working directory if they will be needed for the next task. These rules make writing the WDL trivial. 3. The WDL should be expected to handle minimal logic. Use wrapper scripts to deal with logic if need be. 4. Also, scripts should return a code of :bash:`0` if it was successfull. And don't write anything but errors to :bash:`stderr`. Cromwell depends on seeing a return code of :bash:`0` on success and JAWS depends on seeing errors written to :bash:`stderr`. Sometimes, scripts write errors to stdout and these will be missed if you try and see the errors via :bash:`erros.json` supplementary files created by JAWS. .. dropdown:: Example :color: info :animate: fade-in .. code-block:: text # This explicitly lists all input files, and output file. filterFastq.py in=${fastq} ref=${refdata} huseq=${hu_fasta} out=myout # This script expects the files to exist implicitly filterFastq.py ref=${refdata} .. dropdown:: Use subworkflows :color: info :animate: fade-in Consider using subworkflows if organizing tasks that way makes the main workflow more understandable, reusable, and maintainable. Even a single task can be its own workflow. Subworkflows are imported and used as if they were normal tasks, see the example below that was copied from `Cromwell documentation `_. .. dropdown:: Example :color: info :animate: fade-in .. code-block:: text # main.wdl import "sub_wdl.wdl" as sub workflow main_workflow { call sub.hello_and_goodbye { input: hello_and_goodbye_input = "sub world" } # call myTask { input: hello_and_goodbye.hello_output } output { String main_output = hello_and_goodbye.hello_output } } .. code-block:: text # sub_wdl.wdl workflow hello_and_goodbye { String hello_and_goodbye_input call hello {input: addressee = hello_and_goodbye_input } call goodbye {input: addressee = hello_and_goodbye_input } output { String hello_output = hello.salutation String goodbye_output = goodbye.salutation } } task hello { String addressee command <<< echo "Hello ${addressee}!" >>> output { String salutation = read_string(stdout()) } } task goodbye { String addressee command <<< echo "Goodbye ${addressee}!" >>> output { String salutation = read_string(stdout()) } } .. dropdown:: Documenting your WDLs :color: info :animate: fade-in The best way to document your WDLs is with a README.md that is in the same repository as the WDL. However, adding "metadata" sections in the WDL is also best practice since you will hard-code some relevant information this way, like author, contact info, etc. See the WDL template as an example. .. dropdown:: Build Docker Images Through CI/CD :color: info :animate: fade-in Do you want docker images to get re-build everytime you push to a gitlab repository? Here is an example of how to set up a pipeline so docker images can be automatically built and pushed to `hub.docker.com `_ everytime you make a change to the repo code. Details can be found in the `readme `_. .. dropdown:: Install `bash` to Docker Image :color: info :animate: fade-in `Bash `_ is required in your Docker image for running JAWS. For example, if you based Docker image is Ubuntu, Bash is already available. However, Alpine Docker image does not have bash installed by default. You will need to add the following commands to get `bash`: .. code-block:: text RUN apk update && apk add bash .. dropdown:: requirements for call-caching :color: info :animate: fade-in Here are some reasons why call-caching may not have worked... * Call caching requires consistency in the inputs of the task. Make sure there were not any changes to a task's :bash:`input{}` section, including any variable values. * Call caching may have failed if your files are being fed in as a :bash:`String` rather than :bash:`File` inputs. The hashes of two identical files stored in different locations would be the same. The hashes of the :bash:`String` values for the different locations would be different, even though the contents of the file are the same. * Call caching also requires consistency in the :bash:`output{}` section of the task. If there was a change in the content of an output file, but the name was the same, call-caching will still happen (i.e. when there are un-deterministic outputs). * Call caching also requires consistency in the :bash:`command{}` section of the task. * Changing :bash:`runtime{}` values that are hard-coded will not prevent call-caching, except for docker (and ContinueOnReturnCode, FailOnStderr but these are not accepted in jaws runtimes). Remember that changing runtime variables such as memory or cpu using task inputs will break call caching since this is registered as a change to the inputs for the task. * Don't include use conditional statements in a task except within the :bash:`command{}` section. It's not the conditional itself, but the cromwell variables used within the conditional statement that prevent call-caching. Interestingly, miniwdl will cache task_two fine. .. dropdown:: example of bad conditional (see task_two) :color: info :animate: fade-in .. code-block:: text version 1.0 workflow test_call_cache { call task_one { } call task_two { input: single = task_one.single } } task task_one { command <<< echo single reads > single.txt >>> output { File single = "single.txt" } } task task_two { input { File single Boolean isSingleEnd = true } # this is the offending line. The solution is to put any conditionals in the command{} section. String reads_input_flag = if(isSingleEnd) then "-U ~{single}" else "-1 ~{single}" command <<< # this commented code is the fix # if [[ ~{isSingleEnd} ]]; then # reads_input_flag="-U ~{single}" # else # reads_input_flag="-1 ~{single}" # fi echo ~{reads_input_flag} >>> } References * `cromwell docs on call-caching `_ Templates ---------- .. dropdown:: WDL Best Practices Template :color: info :animate: fade-in .. code-block:: text # By versioning your WDL, you specify which specification cromwell uses to decifer the WDL. # New features come with new versions. version 1.0 # import any subworkflows import "subworkflow.wdl" as firstStep workflow bbtools { meta { authors: [ { name: "Jackson Brown" email: "jbrown@my-inst" organization: "JGI" } ] version: "2222.2.0" notes: "this is the official release version" } # you must have this input section within the "workflow" stanza if you are using version 1 input { File reads File ref String docker_image = "jfroula/bbtools@sha256:cf560d21149237feff9210b0cd831dcc532ebdccaaa3f5ede52369f45a23e768" } call firstStep { input: fastq=reads, docker_image=docker_image } call alignment { input: fastq=reads, fasta=ref, docker_image=docker_image } call samtools { input: sam=alignment.sam } } # # below are task definitions # task alignment { # Metadata is good for helping the next guy understand your code. # This meta section can also be used for documentation generated by wdl-aid. # You can run "wdl-aid " if it is installed, see https://wdl-aid.readthedocs.io/en/latest/usage.html) meta { metaParameter1: "Some meta Data I" metaParameter2: "Some meta Data II" description: "Add a brief description of what this task does in this optional block. One can add as much text as one wants in this section to inform an outsider to understand the mechanics of this task." } input { File fastq File fasta String docker_image } command <<< # Use this command to help debug your bash code (i.e. prevents hidden bugs). # For a description, see https://gist.github.com/mohanpedala/1e2ff5661761d3abd0385e8223e16425 set -euo pipefail # Note that ~{} is prefered over the old ${} syntax bbmap.sh in=~{fastq} ref=~{fasta} out=test.sam >>> runtime { docker: docker_image cpu: 8 memory: "5GB" runtime_minutes: 120 } output { File sam = "test.sam" } # This section is optional and used to create documentation using the wdl-aid tool. # see https://wdl-aid.readthedocs.io/en/latest/usage.html # You can run "wdl-aid " if it is installed. parameter_meta { fastq: {description: "file containing reads", category: "required"} fasta: {description: "file containing referenece sequences", category: "required"} docker_image: {description: "docker image containing BBTools", category: "required"} } } .. dropdown:: Dockerfile template :color: info :animate: fade-in .. code-block:: text FROM ubuntu:22.04 # install stuff with apt-get RUN apt-get update && apt-get install -y wget bzip2 \ && rm -rf /var/lib/apt/lists/* # install miniconda # There is a good reason to install miniconda in a path other than its default. # The default intallation directory is /root/miniconda3 but this path will not be # accessible by shifter or singularity so we'll install under /usr/local/bin/miniconda3. ENV CONDAPATH=/usr/local/bin/miniconda3 RUN wget https://repo.continuum.io/miniconda/Miniconda3-4.5.11-Linux-x86_64.sh \ && bash ./Miniconda3*.sh -b -p $CONDAPATH \ && rm Miniconda3*.sh # point to all the future conda installations you are going to do ENV PATH=$CONDAPATH/bin:$PATH # Install stuff with conda # Remember to use versions of everything you install with conda as shown in example. RUN conda install -c bioconda bbmap==38.84 samtools==1.11 && conda clean -afy # copy bash/python scripts specific to your pipeline COPY scripts/* /usr/local/bin/ Additional helpful notes when building Docker images: ----------------------------------------------------- * The dockerfile template uses the strategy of installing miniconda so you can use :bash:`conda install` for probably, most of your tools. However, :bash:`pip install` and :bash:`apt-get install` work in addition to, or instead of miniconda. * Also, remember to use versions of everything you install with conda as shown in above docker template example. * There is a good reason to install miniconda in a path other than its default. The default installation directory is :bash:`/root/miniconda3` but this path will not be accessible by shifter or singularity. * When you build your docker image (i.e. :bash:`docker build --tag -f ./Dockerfile3 .`), all files in the current directory (and sub-directories) are transfered to the local docker daemon. This transfer step can be time consuming. Therefore, docker builds should be performed in a directory without extraneous files. * One helpful thing you can do when developing docker images is to create a bare essentials image with your favorite editor installed (i.e. vim). Then you can go into the container interactively :bash:`docker run --it ` and see if you can install stuff manually, then just copy those same commands into the final dockerfile. For more see the docker official docs on `best practices `_.