Using Reference Data in Your WDLs

Overview

JAWS provides a centralized reference data system that allows you to store and access large, reusable datasets (such as BLAST databases, genome references, or annotation files) without copying them into every workflow submission. This approach:

  • Saves time: No need to upload or stage large files repeatedly

  • Reduces storage: Single copy shared across teams and users

  • Ensures consistency: All JAWS compute sites access the same synchronized data

  • Improves reproducibility: Reference data versions are stable and accessible across sites

Reference data is stored on NERSC Perlmutter and automatically synchronized to all JAWS execution sites. Inside your WDL workflows, reference data is accessible via the /refdata mount point.

Before You Start

Understanding Project Paths and Groups

Your project determines which base path you use, the compute sites where your data will sync, and the required Linux group ownership. Identify your project below:

Project

Refdata Base Path (<REFDATA>)

Syncs to Sites

Linux Group

JGI

/global/dna/shared/databases/jaws/refdata

tahoma, dori, jgi, crux, defiant

gemone

NMDC

/global/cfs/cdirs/m3408/refdata

nmdc_tahoma

m3408

KBase

/global/dna/kbase/reference/jaws

jgi

kbase

SuperBio

/global/cfs/cdirs/m4731/refdata

NA

slacjaw

RNome

/global/cfs/cdirs/m5243/refdata

NA

m5243

Note

Throughout this guide, <REFDATA> refers to your project’s base path from the table above.

Prerequisites

Before adding reference data, ensure you have:

  • An account on NERSC Perlmutter

  • Membership in the appropriate Linux group (see table above)

  • SSH access to NERSC DTN (Data Transfer Nodes)

Initial Setup: Creating Your Reference Data Folder

This is a one-time setup to establish your group’s folder within the project refdata directory.

Step 1: Log into a DTN Node

Refdata directories are read-only on compute nodes. You must use a Data Transfer Node (DTN) to create folders and copy files.

ssh dtn0[1-4].nersc.gov

More details on DTN access: NERSC DTN Access

Step 2: Create Your Group Folder

Create a subdirectory under your project’s refdata path:

mkdir <REFDATA>/<your_group_name>

Example for JGI users:

mkdir /global/dna/shared/databases/jaws/refdata/myteam_references_dataset

Step 3: Set Correct Group Ownership

Ensure your folder belongs to the correct Linux group:

chgrp <GROUP> <REFDATA>/<your_group_name>

Example:

chgrp gemone /global/dna/shared/databases/jaws/refdata/myteam_references_dataset

Step 4: Set Permissions

Set directory permissions to allow group read/write access:

chmod 440 <REFDATA>/<your_group_name>

This ensures:

  • Owner and group can read and write files

  • Others have no access

Verify permissions:

ls -ld <REFDATA>/<your_group_name>

You should see something like: drwxr-s---

Warning

Files and directories must be readable by the JAWS service account. At minimum, ensure group read permissions (440 for files, 550 for directories).

Adding and Managing Data

Step 1: Copy Your Data to the DTN

You can copy data from several locations on NERSC or for external sources. NERSC locations include:

  • $HOME (your home directory)

  • /global/cfs (Community File System)

  • /global/pscratch (Perlmutter scratch)

Step 2: Verify File Permissions

Ensure all files have correct group ownership and are readable by the JAWS service account:

ls -l <REFDATA>/<your_group_name>/

Expected file permissions: -r--r----- (440) or better.

If needed, recursively fix permissions:

chgrp -R <GROUP> <REFDATA>/<your_group_name>/
chmod -R 440 <REFDATA>/<your_group_name>/*

Warning

Symlinks are not supported. Do not use symbolic links (e.g., latest -> v10.4). Symlinks will not sync to remote sites. Always copy the actual files.

Step 3: Create or Update the Manifest File

To trigger synchronization to all JAWS sites, you must create a manifest file that lists the data to sync.

Important

Manifest file naming convention: <USERNAME>_changes.txt

Example: dcassol_changes.txt

Create your manifest file in the refdata root:

vim <REFDATA>/<USERNAME>_changes.txt

Add full paths to your data (one path per line):

/global/dna/shared/databases/jaws/refdata/myteam_references_dataset/hg38.fa
/global/dna/shared/databases/jaws/refdata/myteam_references_dataset/hg38.fa.fai
/global/dna/shared/databases/jaws/refdata/myteam_references_dataset/annotations/

Note

  • You can specify individual files or entire directories

  • All paths must be absolute paths

  • Paths must exist and be readable by the JAWS service account

  • Adding a directory path will sync all contents recursively

Best Practice: Clear the manifest after each sync

If you add new paths to an existing manifest, Globus will re-validate all listed paths (old and new). Although Globus won’t re-copy unchanged files, listing them causes unnecessary processing. To avoid confusion, clear the manifest after successful sync, then add only new paths for future syncs.

Step 4: Wait for Sync to Complete

A background daemon checks for manifest changes every 20 minutes. When your manifest is detected:

  1. The daemon validates all paths (must be absolute, existing, and readable)

  2. Paths are submitted to the Globus API for transfer

  3. Globus copies files to all remote JAWS sites

  4. Status logs are written to <REFDATA>/log/<USERNAME>_changes.txt.<SITE>.complete

Typical sync time: Transfers usually initiate within 20 minutes and complete based on data size and network speed.

Step 5: Check Transfer Status

After transfer completion, look for status files in the log directory:

Success:

ls <REFDATA>/log/<USERNAME>_changes.txt_<SITE>.complete

Failure:

ls <REFDATA>/log/<USERNAME>_changes.txt_<SITE>.failed

Hint

If a transfer fails, check the corresponding log file for error details. Common issues include incorrect permissions, missing files, or invalid paths. The JAWS support team is automatically alerted to failures. Howewer, please contact jaws support via SLACK with details of the failure to expedite troubleshooting.

Removing Files

To remove reference data from all JAWS sites:

Step 1: Delete from Source

On the DTN, delete files from your group folder:

rm <REFDATA>/<your_group_name>/old_database/*

Step 2: Wait for Monthly Sync

File deletions propagate automatically during the monthly full sync. You do not need to update your manifest file or take any additional action.

Warning

Deleted files will remain accessible on remote sites until the next monthly sync completes.

How Synchronization Works

Understanding the automation behind refdata sync can help you troubleshoot issues and plan your data management.

Background Daemons

  1. Manifest Monitor Daemon: Runs every 20 minutes

    • Detects new or modified <USERNAME>_changes.txt files

    • Validates that all listed paths are:

      • Absolute paths (start with /)

      • Existing on the filesystem

      • Readable by the JAWS service account

    • Submits validated paths to the Globus API for transfer

  2. Transfer Monitor Daemon: Runs continuously

    • Monitors active Globus transfers via API

    • Writes progress to log files

    • Updates ElasticSearch for dashboard visibility (future feature)

  3. Alerts:

    • JAWS support team is notified of all transfer failures

Monthly Full Sync

Once per month, Globus performs a full directory sync from Perlmutter to all remote sites. This sync:

  • Ensures all sites have identical copies

  • Deletes files at destination sites that no longer exist at the source

This is how file removals propagate (see previous section).

Using Reference Data in WDLs

Accessing Reference Data Paths

Inside your WDL workflows, reference data is mounted at /refdata. The structure mirrors your group folder on Perlmutter:

  • On Perlmutter: <REFDATA>/<your_group_name>/database.fa

  • In WDL container: /refdata/<your_group_name>/database.fa

Critical: Use String Type, Not File

Warning

Always declare reference data paths as ``String``, not ``File``.

WDL variables declared as File are staged into Cromwell’s execution directory. Since /refdata does not exist outside the container, Cromwell will fail to validate the path during input processing.

Correct:

String reference_db = "/refdata/<your_group_name>/hg38.fa"

Incorrect:

File reference_db = "/refdata/<your_group_name>/hg38.fa"  # Will fail!

Example WDL: BLAST Against Reference Database

This example demonstrates a realistic use case: running BLAST against a reference database stored in /refdata.

WDL (blast_workflow.wdl):

WDL Example

version 1.0

workflow blast_search {
  input {
    File query_sequences
    String blast_db_path
    String blast_db_name
    Int num_threads = 4
  }

  call run_blastn {
    input:
      query = query_sequences,
      db_path = blast_db_path,
      db_name = blast_db_name,
      threads = num_threads
  }

  output {
    File blast_results = run_blastn.results
    File blast_log = run_blastn.log
  }
}

task run_blastn {
  input {
    File query
    String db_path      # Path to refdata directory (e.g., /refdata/blast_dbs)
    String db_name      # Database name (e.g., nt)
    Int threads
  }

  command <<<
    set -euo pipefail

    # The database path is mounted at /refdata inside the container
    # Construct full database path: /refdata/blast_dbs/nt
    DB_FULL_PATH="~{db_path}/~{db_name}"

    echo "Running BLASTN against ${DB_FULL_PATH}" > blast.log
    echo "Query file: ~{query}" >> blast.log

    blastn \
      -query ~{query} \
      -db ${DB_FULL_PATH} \
      -out results.txt \
      -outfmt 6 \
      -num_threads ~{threads} \
      -max_target_seqs 10

    echo "BLASTN completed successfully" >> blast.log
  >>>

  runtime {
    docker: "ncbi/blast:latest"
    cpu: threads
    memory: "8 GB"
    runtime_minutes: 120
  }

  output {
    File results = "results.txt"
    File log = "blast.log"
  }
}

Inputs JSON (blast_inputs.json):

{
  "blast_search.query_sequences": "/path/to/my_sequences.fasta",
  "blast_search.blast_db_path": "/refdata/blast_dbs",
  "blast_search.blast_db_name": "nt",
  "blast_search.num_threads": 8
}

Key points:

  • blast_db_path is declared as String (not File)

  • The path /refdata/blast_dbs/nt refers to the BLAST database index files

  • Inside the container, /refdata is automatically mounted by Cromwell

  • The actual files on Perlmutter are in <REFDATA>/blast_dbs/nt.*

Troubleshooting

Common Issues

Issue: “Path does not exist” error during manifest validation

  • Cause: Path listed in manifest file doesn’t exist or has a typo

  • Solution: Verify all paths with ls and ensure absolute paths are used

Issue: “Permission denied” error during sync

  • Cause: Files not readable by JAWS service account

  • Solution: Fix permissions: chmod -R 440 <REFDATA>/<your_group_name>/* and ensure correct group: chgrp -R <GROUP> <REFDATA>/<your_group_name>/*

Issue: Transfer never starts

  • Cause: Manifest file not detected or malformed

  • Solution: Ensure manifest file is named <USERNAME>_changes.txt and is in the <REFDATA> root directory. Wait at least 20 minutes after editing.

Issue: Files not appearing in WDL at runtime

  • Cause: Files may not have synced to the execution site yet

  • Solution: Check <REFDATA>/log/ for transfer status. Wait for sync completion before running workflows.

Issue: Symlinks not working in WDL

  • Cause: Symlinks are not supported

  • Solution: Copy actual files instead of creating symlinks

Getting Help

If you encounter issues:

  1. Check transfer logs in <REFDATA>/log/

  2. Verify permissions with ls -l

  3. Contact JAWS support via SLACK with details of the issue and any relevant log file contents for faster troubleshooting.