Using Reference Data in Your WDLs

JAWS provides a dedicated location to store large, reusable files (e.g., NR databases) that won’t be copied every time a WDL is submitted. This saves time and resources while ensuring data availability across all JAWS sites.

Creating Your Reference Data Folder

On Perlmutter, you can create a reference data folder based on your project. Ensure the appropriate Linux group and permissions are set for your files.

USERS

Refdata Path

Sync Sites Locations

Linux Group

JGI

/global/dna/shared/databases/jaws/refdata/

tahoma, dori, jgi

gemone

NMDC

/global/cfs/cdirs/m3408/refdata/

nmdc_tahoma

m3408

KBase

/global/dna/kbase/reference/jaws/

jgi

kbase

  • Files must belong to the correct Linux group with minimum permissions of 440 (read-only for group and owner).

  • Create a folder under the appropriate project path, e.g., <PROJECT REFDATA>/<your-group-name>, but you need to log into a dtn node (i.e. ssh dtn04) since they are read-only nodes.

  • When files are added to these folders, they will automatically sync to all associated JAWS sites.

  • You can access these files within your WDL using the path: /refdata/<your-group-name>.

Adding Data to refdata Directory

1. Accessing the Directory: Log into a DTN node (e.g., ssh dtn04) to create and manage your folders and files, as compute nodes are read-only.

2. Setting Permissions: Ensure your folders and files are readable by the jaws service account user. Recommended permission: (i.e. drwxrwsr-x+). Check group permissions with ls -l and set them with chmod and chgrp.

3. Copying Data: You can copy data from:

  • Global home

  • Global common

  • Community File System (CFS)

Note: /pscratch is not accessible directly unless using Globus.

4. Using Globus for Fast Transfers: Globus is the fastest and most reliable method for copying large data. Use the NERSC Perlmutter ==> NERSC DTN endpoints. Globus can read data from /pscratch.

5. NO Symlinks: Symlinks (e.g., latest -> v10.4) are not supported and will not be maintained across sites. Always copy the actual files.

6. Updating the Manifest File: Besides adding your files to your group folder, you need to cut and paste the full paths to a “manifest file” that is saved in the refdata root. You create your own manifest file and it should be named like <USER>_changes.txt. Modifying this file will trigger globus to copy your files to all the other JAWS sites. For example:

  • If I added /<REFDATA>/ekirton/tiny.fastq to jfroula_changes.`txt, in 20mins or less, globus would initiate a copy to all sites.

  • Of course you can add folders, e.g. /<REFDATA>/ekirton would copy everything.

  • If you add new file paths to an existing manifest that already contains previously copied paths, all of them — including the old ones — will be included in the list for copying. However, Globus won’t re-copy a file if it already exists at the destination and hasn’t changed. The best practice is to clear the manifest before adding new file paths to avoid unnecessary confusion.

How it Works

There is a daemon running in the background that checks every 20 minutes for any modifications in a <USER>_changes.txt file. If a new file is created or the contents are changed of an existing one, this daemon will will validate that the files paths are:

  1. full paths

  2. existing paths

  3. have read permission by the jaws service account user.

The daemon will gather all the paths from any <USER>_changes.txt file that has been modified and send them to the globus API for transfer. Another daemon will monitor the transfer via globus API and write to a log every so often. This log is monitored by our monitoring system (the ElasticSearch Stack) so that you’ll be able to see the status on the JAWS dashboard (not available yet). Finally, the JAWS team will also be alerted if there is a transfer failure.

Removing Files

When you want to remove files, just delete them from your group folder. You don’t need to do anything else because once a month, globus will copy all folders and in doing so will delete anything on the destination site that is not on the source site.

How to use refdata in your WDLs

Use /refdata in your WDLs as the root. For example, if you wanted to run a blast command in your WDL, you would point to the database like: blastn -db /refdata/nt_test/nt where nt_test is where you saved all the blast index files.

Hint

In your WDL, the input type for refdata files should be specified as String and not File. Variables specified with File are copied into Cromwell’s working directory, and since /refdata doesn’t exist outside the container, JAWS will fail to validate the path and you’ll get an error.

Example

WDL Example

version 1.0
workflow refdata_wf {
    call task1 { }
}

task task1 {

    command <<<
      # How to access reference data. The command is being run in a
      # docker container and the path to refdata outside the
      # container is mounted as "/refdata" inside the container. The
      # mounting of which happens in the cromwell config file.

      ls /refdata/nt_test
    >>>

    runtime {
      docker: "ubuntu:latest"
      cpu: 1
      memory: "1G"
    }

    output { String outfile = stdout() }
}