Using Reference Data in Your WDLs

Overview

JAWS provides a centralized reference data system that allows you to store and access large, reusable datasets (such as BLAST databases, genome references, or annotation files) without copying them into every workflow submission. This approach:

Saves time: No need to upload or stage large files repeatedly
Reduces storage: Single copy shared across teams and users
Ensures consistency: All JAWS compute sites access the same synchronized data
Improves reproducibility: Reference data versions are stable and accessible across sites

Reference data is stored on NERSC Perlmutter and automatically synchronized to all JAWS execution sites. Inside your WDL workflows, reference data is accessible via the /refdata mount point.

Before You Start

Understanding Project Paths and Groups

Your project determines which base path you use, the compute sites where your data will sync, and the required Linux group ownership. Identify your project below:

Project	Refdata Base Path (`<REFDATA>`)	Syncs to Sites	Linux Group
JGI	`/global/dna/shared/databases/jaws/refdata`	tahoma, dori, jgi, crux, defiant	`gemone`
NMDC	`/global/cfs/cdirs/m3408/refdata`	nmdc_tahoma	`m3408`
KBase	`/global/dna/kbase/reference/jaws`	jgi	`kbase`
SuperBio	`/global/cfs/cdirs/m4731/refdata`	NA	`slacjaw`
RNome	`/global/cfs/cdirs/m5243/refdata`	NA	`m5243`

Note

Throughout this guide, <REFDATA> refers to your project’s base path from the table above.

Prerequisites

Before adding reference data, ensure you have:

An account on NERSC Perlmutter
Membership in the appropriate Linux group (see table above)
SSH access to NERSC DTN (Data Transfer Nodes)

Initial Setup: Creating Your Reference Data Folder

This is a one-time setup to establish your group’s folder within the project refdata directory.

Step 1: Log into a DTN Node

Refdata directories are read-only on compute nodes. You must use a Data Transfer Node (DTN) to create folders and copy files.

ssh dtn0[1-4].nersc.gov

More details on DTN access: NERSC DTN Access

Step 2: Create Your Group Folder

Create a subdirectory under your project’s refdata path:

mkdir <REFDATA>/<your_group_name>

Example for JGI users:

mkdir /global/dna/shared/databases/jaws/refdata/myteam_references_dataset

Step 3: Set Correct Group Ownership

Ensure your folder belongs to the correct Linux group:

chgrp <GROUP> <REFDATA>/<your_group_name>

Example:

chgrp gemone /global/dna/shared/databases/jaws/refdata/myteam_references_dataset

Step 4: Set Permissions

Set directory permissions to allow group read/write access:

chmod 440 <REFDATA>/<your_group_name>

This ensures:

Owner and group can read and write files
Others have no access

Verify permissions:

ls -ld <REFDATA>/<your_group_name>

You should see something like: drwxr-s---

Warning

Files and directories must be readable by the JAWS service account. At minimum, ensure group read permissions (440 for files, 550 for directories).

Adding and Managing Data

Step 1: Copy Your Data to the DTN

You can copy data from several locations on NERSC or for external sources. NERSC locations include:

$HOME (your home directory)
/global/cfs (Community File System)
/global/pscratch (Perlmutter scratch)

Step 2: Verify File Permissions

Ensure all files have correct group ownership and are readable by the JAWS service account:

ls -l <REFDATA>/<your_group_name>/

Expected file permissions: -r--r----- (440) or better.

If needed, recursively fix permissions:

chgrp -R <GROUP> <REFDATA>/<your_group_name>/
chmod -R 440 <REFDATA>/<your_group_name>/*

Warning

Symlinks are not supported. Do not use symbolic links (e.g., latest -> v10.4). Symlinks will not sync to remote sites. Always copy the actual files.

Step 3: Create or Update the Manifest File

To trigger synchronization to all JAWS sites, you must create a manifest file that lists the data to sync.

Important

Manifest file naming convention: <USERNAME>_changes.txt

Example: dcassol_changes.txt

Create your manifest file in the refdata root:

vim <REFDATA>/<USERNAME>_changes.txt

Add full paths to your data (one path per line):

/global/dna/shared/databases/jaws/refdata/myteam_references_dataset/hg38.fa
/global/dna/shared/databases/jaws/refdata/myteam_references_dataset/hg38.fa.fai
/global/dna/shared/databases/jaws/refdata/myteam_references_dataset/annotations/

Note

You can specify individual files or entire directories
All paths must be absolute paths
Paths must exist and be readable by the JAWS service account
Adding a directory path will sync all contents recursively

Best Practice: Clear the manifest after each sync

If you add new paths to an existing manifest, Globus will re-validate all listed paths (old and new). Although Globus won’t re-copy unchanged files, listing them causes unnecessary processing. To avoid confusion, clear the manifest after successful sync, then add only new paths for future syncs.

Step 4: Wait for Sync to Complete

A background daemon checks for manifest changes every 20 minutes. When your manifest is detected:

The daemon validates all paths (must be absolute, existing, and readable)
Paths are submitted to the Globus API for transfer
Globus copies files to all remote JAWS sites
Status logs are written to <REFDATA>/log/<USERNAME>_changes.txt.<SITE>.complete

Typical sync time: Transfers usually initiate within 20 minutes and complete based on data size and network speed.

Step 5: Check Transfer Status

After transfer completion, look for status files in the log directory:

Success:

ls <REFDATA>/log/<USERNAME>_changes.txt_<SITE>.complete

Failure:

ls <REFDATA>/log/<USERNAME>_changes.txt_<SITE>.failed

Hint

If a transfer fails, check the corresponding log file for error details. Common issues include incorrect permissions, missing files, or invalid paths. The JAWS support team is automatically alerted to failures. Howewer, please contact jaws support via SLACK with details of the failure to expedite troubleshooting.

Removing Files

To remove reference data from all JAWS sites:

Step 1: Delete from Source

On the DTN, delete files from your group folder:

rm <REFDATA>/<your_group_name>/old_database/*

Step 2: Wait for Monthly Sync

File deletions propagate automatically during the monthly full sync. You do not need to update your manifest file or take any additional action.

Warning

Deleted files will remain accessible on remote sites until the next monthly sync completes.

How Synchronization Works

Understanding the automation behind refdata sync can help you troubleshoot issues and plan your data management.

Background Daemons

Manifest Monitor Daemon: Runs every 20 minutes
- Detects new or modified <USERNAME>_changes.txt files
- Validates that all listed paths are:
  - Absolute paths (start with /)
  - Existing on the filesystem
  - Readable by the JAWS service account
- Submits validated paths to the Globus API for transfer
Transfer Monitor Daemon: Runs continuously
- Monitors active Globus transfers via API
- Writes progress to log files
- Updates ElasticSearch for dashboard visibility (future feature)
Alerts:
- JAWS support team is notified of all transfer failures

Monthly Full Sync

Once per month, Globus performs a full directory sync from Perlmutter to all remote sites. This sync:

Ensures all sites have identical copies
Deletes files at destination sites that no longer exist at the source

This is how file removals propagate (see previous section).

Using Reference Data in WDLs

Accessing Reference Data Paths

Inside your WDL workflows, reference data is mounted at /refdata. The structure mirrors your group folder on Perlmutter:

On Perlmutter: <REFDATA>/<your_group_name>/database.fa
In WDL container: /refdata/<your_group_name>/database.fa

Critical: Use `String` Type, Not `File`

Warning

Always declare reference data paths as ``String``, not ``File``.

WDL variables declared as File are staged into Cromwell’s execution directory. Since /refdata does not exist outside the container, Cromwell will fail to validate the path during input processing.

Correct:

String reference_db = "/refdata/<your_group_name>/hg38.fa"

Incorrect:

File reference_db = "/refdata/<your_group_name>/hg38.fa"  # Will fail!

Example WDL: BLAST Against Reference Database

This example demonstrates a realistic use case: running BLAST against a reference database stored in /refdata.

Key points:

blast_db_path is declared as String (not File)
The path /refdata/blast_dbs/nt refers to the BLAST database index files
Inside the container, /refdata is automatically mounted by Cromwell
The actual files on Perlmutter are in <REFDATA>/blast_dbs/nt.*

Troubleshooting

Common Issues

Issue: “Path does not exist” error during manifest validation

Cause: Path listed in manifest file doesn’t exist or has a typo
Solution: Verify all paths with ls and ensure absolute paths are used

Issue: “Permission denied” error during sync

Cause: Files not readable by JAWS service account
Solution: Fix permissions: chmod -R 440 <REFDATA>/<your_group_name>/* and ensure correct group: chgrp -R <GROUP> <REFDATA>/<your_group_name>/*

Issue: Transfer never starts

Cause: Manifest file not detected or malformed
Solution: Ensure manifest file is named <USERNAME>_changes.txt and is in the <REFDATA> root directory. Wait at least 20 minutes after editing.

Issue: Files not appearing in WDL at runtime

Cause: Files may not have synced to the execution site yet
Solution: Check <REFDATA>/log/ for transfer status. Wait for sync completion before running workflows.

Issue: Symlinks not working in WDL

Cause: Symlinks are not supported
Solution: Copy actual files instead of creating symlinks

Getting Help

If you encounter issues:

Check transfer logs in <REFDATA>/log/
Verify permissions with ls -l
Contact JAWS support via SLACK with details of the issue and any relevant log file contents for faster troubleshooting.

Using Reference Data in Your WDLs

Overview

Before You Start

Understanding Project Paths and Groups

Prerequisites

Initial Setup: Creating Your Reference Data Folder

Step 1: Log into a DTN Node

Step 2: Create Your Group Folder

Step 3: Set Correct Group Ownership

Step 4: Set Permissions

Adding and Managing Data

Step 1: Copy Your Data to the DTN

Step 2: Verify File Permissions

Step 3: Create or Update the Manifest File

Step 4: Wait for Sync to Complete

Step 5: Check Transfer Status

Removing Files

How Synchronization Works

Background Daemons

Monthly Full Sync

Using Reference Data in WDLs

Accessing Reference Data Paths

Critical: Use String Type, Not File

Example WDL: BLAST Against Reference Database

Troubleshooting

Common Issues

Getting Help

Critical: Use `String` Type, Not `File`