Using Reference Data in Your WDLs
Overview
JAWS provides a centralized reference data system that allows you to store and access large, reusable datasets (such as BLAST databases, genome references, or annotation files) without copying them into every workflow submission. This approach:
Saves time: No need to upload or stage large files repeatedly
Reduces storage: Single copy shared across teams and users
Ensures consistency: All JAWS compute sites access the same synchronized data
Improves reproducibility: Reference data versions are stable and accessible across sites
Reference data is stored on NERSC Perlmutter and automatically synchronized to all JAWS execution sites. Inside your WDL workflows, reference data is accessible via the /refdata mount point.
Before You Start
Understanding Project Paths and Groups
Your project determines which base path you use, the compute sites where your data will sync, and the required Linux group ownership. Identify your project below:
Project |
Refdata Base Path ( |
Syncs to Sites |
Linux Group |
|---|---|---|---|
JGI |
|
tahoma, dori, jgi, crux, defiant |
|
NMDC |
|
nmdc_tahoma |
|
KBase |
|
jgi |
|
SuperBio |
|
NA |
|
RNome |
|
NA |
|
Note
Throughout this guide, <REFDATA> refers to your project’s base path from the table above.
Prerequisites
Before adding reference data, ensure you have:
An account on NERSC Perlmutter
Membership in the appropriate Linux group (see table above)
SSH access to NERSC DTN (Data Transfer Nodes)
Initial Setup: Creating Your Reference Data Folder
This is a one-time setup to establish your group’s folder within the project refdata directory.
Step 1: Log into a DTN Node
Refdata directories are read-only on compute nodes. You must use a Data Transfer Node (DTN) to create folders and copy files.
ssh dtn0[1-4].nersc.gov
More details on DTN access: NERSC DTN Access
Step 2: Create Your Group Folder
Create a subdirectory under your project’s refdata path:
mkdir <REFDATA>/<your_group_name>
Example for JGI users:
mkdir /global/dna/shared/databases/jaws/refdata/myteam_references_dataset
Step 3: Set Correct Group Ownership
Ensure your folder belongs to the correct Linux group:
chgrp <GROUP> <REFDATA>/<your_group_name>
Example:
chgrp gemone /global/dna/shared/databases/jaws/refdata/myteam_references_dataset
Step 4: Set Permissions
Set directory permissions to allow group read/write access:
chmod 440 <REFDATA>/<your_group_name>
This ensures:
Owner and group can read and write files
Others have no access
Verify permissions:
ls -ld <REFDATA>/<your_group_name>
You should see something like: drwxr-s---
Warning
Files and directories must be readable by the JAWS service account. At minimum, ensure group read permissions (440 for files, 550 for directories).
Adding and Managing Data
Step 1: Copy Your Data to the DTN
You can copy data from several locations on NERSC or for external sources. NERSC locations include:
$HOME(your home directory)/global/cfs(Community File System)/global/pscratch(Perlmutter scratch)
Step 2: Verify File Permissions
Ensure all files have correct group ownership and are readable by the JAWS service account:
ls -l <REFDATA>/<your_group_name>/
Expected file permissions: -r--r----- (440) or better.
If needed, recursively fix permissions:
chgrp -R <GROUP> <REFDATA>/<your_group_name>/
chmod -R 440 <REFDATA>/<your_group_name>/*
Warning
Symlinks are not supported. Do not use symbolic links (e.g., latest -> v10.4). Symlinks will not sync to remote sites. Always copy the actual files.
Step 3: Create or Update the Manifest File
To trigger synchronization to all JAWS sites, you must create a manifest file that lists the data to sync.
Important
Manifest file naming convention: <USERNAME>_changes.txt
Example: dcassol_changes.txt
Create your manifest file in the refdata root:
vim <REFDATA>/<USERNAME>_changes.txt
Add full paths to your data (one path per line):
/global/dna/shared/databases/jaws/refdata/myteam_references_dataset/hg38.fa
/global/dna/shared/databases/jaws/refdata/myteam_references_dataset/hg38.fa.fai
/global/dna/shared/databases/jaws/refdata/myteam_references_dataset/annotations/
Note
You can specify individual files or entire directories
All paths must be absolute paths
Paths must exist and be readable by the JAWS service account
Adding a directory path will sync all contents recursively
Best Practice: Clear the manifest after each sync
If you add new paths to an existing manifest, Globus will re-validate all listed paths (old and new). Although Globus won’t re-copy unchanged files, listing them causes unnecessary processing. To avoid confusion, clear the manifest after successful sync, then add only new paths for future syncs.
Step 4: Wait for Sync to Complete
A background daemon checks for manifest changes every 20 minutes. When your manifest is detected:
The daemon validates all paths (must be absolute, existing, and readable)
Paths are submitted to the Globus API for transfer
Globus copies files to all remote JAWS sites
Status logs are written to
<REFDATA>/log/<USERNAME>_changes.txt.<SITE>.complete
Typical sync time: Transfers usually initiate within 20 minutes and complete based on data size and network speed.
Step 5: Check Transfer Status
After transfer completion, look for status files in the log directory:
Success:
ls <REFDATA>/log/<USERNAME>_changes.txt_<SITE>.complete
Failure:
ls <REFDATA>/log/<USERNAME>_changes.txt_<SITE>.failed
Hint
If a transfer fails, check the corresponding log file for error details. Common issues include incorrect permissions, missing files, or invalid paths. The JAWS support team is automatically alerted to failures. Howewer, please contact jaws support via SLACK with details of the failure to expedite troubleshooting.
Removing Files
To remove reference data from all JAWS sites:
Step 1: Delete from Source
On the DTN, delete files from your group folder:
rm <REFDATA>/<your_group_name>/old_database/*
Step 2: Wait for Monthly Sync
File deletions propagate automatically during the monthly full sync. You do not need to update your manifest file or take any additional action.
Warning
Deleted files will remain accessible on remote sites until the next monthly sync completes.
How Synchronization Works
Understanding the automation behind refdata sync can help you troubleshoot issues and plan your data management.
Background Daemons
Manifest Monitor Daemon: Runs every 20 minutes
Detects new or modified
<USERNAME>_changes.txtfilesValidates that all listed paths are:
Absolute paths (start with
/)Existing on the filesystem
Readable by the JAWS service account
Submits validated paths to the Globus API for transfer
Transfer Monitor Daemon: Runs continuously
Monitors active Globus transfers via API
Writes progress to log files
Updates ElasticSearch for dashboard visibility (future feature)
Alerts:
JAWS support team is notified of all transfer failures
Monthly Full Sync
Once per month, Globus performs a full directory sync from Perlmutter to all remote sites. This sync:
Ensures all sites have identical copies
Deletes files at destination sites that no longer exist at the source
This is how file removals propagate (see previous section).
Using Reference Data in WDLs
Accessing Reference Data Paths
Inside your WDL workflows, reference data is mounted at /refdata. The structure mirrors your group folder on Perlmutter:
On Perlmutter:
<REFDATA>/<your_group_name>/database.faIn WDL container:
/refdata/<your_group_name>/database.fa
Critical: Use String Type, Not File
Warning
Always declare reference data paths as ``String``, not ``File``.
WDL variables declared as File are staged into Cromwell’s execution directory.
Since /refdata does not exist outside the container, Cromwell will fail to validate the path during input processing.
Correct:
String reference_db = "/refdata/<your_group_name>/hg38.fa"
Incorrect:
File reference_db = "/refdata/<your_group_name>/hg38.fa" # Will fail!
Example WDL: BLAST Against Reference Database
This example demonstrates a realistic use case: running BLAST against a reference database stored in /refdata.
WDL (blast_workflow.wdl):
WDL Example
version 1.0
workflow blast_search {
input {
File query_sequences
String blast_db_path
String blast_db_name
Int num_threads = 4
}
call run_blastn {
input:
query = query_sequences,
db_path = blast_db_path,
db_name = blast_db_name,
threads = num_threads
}
output {
File blast_results = run_blastn.results
File blast_log = run_blastn.log
}
}
task run_blastn {
input {
File query
String db_path # Path to refdata directory (e.g., /refdata/blast_dbs)
String db_name # Database name (e.g., nt)
Int threads
}
command <<<
set -euo pipefail
# The database path is mounted at /refdata inside the container
# Construct full database path: /refdata/blast_dbs/nt
DB_FULL_PATH="~{db_path}/~{db_name}"
echo "Running BLASTN against ${DB_FULL_PATH}" > blast.log
echo "Query file: ~{query}" >> blast.log
blastn \
-query ~{query} \
-db ${DB_FULL_PATH} \
-out results.txt \
-outfmt 6 \
-num_threads ~{threads} \
-max_target_seqs 10
echo "BLASTN completed successfully" >> blast.log
>>>
runtime {
docker: "ncbi/blast:latest"
cpu: threads
memory: "8 GB"
runtime_minutes: 120
}
output {
File results = "results.txt"
File log = "blast.log"
}
}
Inputs JSON (blast_inputs.json):
{
"blast_search.query_sequences": "/path/to/my_sequences.fasta",
"blast_search.blast_db_path": "/refdata/blast_dbs",
"blast_search.blast_db_name": "nt",
"blast_search.num_threads": 8
}
Key points:
blast_db_pathis declared asString(notFile)The path
/refdata/blast_dbs/ntrefers to the BLAST database index filesInside the container,
/refdatais automatically mounted by CromwellThe actual files on Perlmutter are in
<REFDATA>/blast_dbs/nt.*
Troubleshooting
Common Issues
Issue: “Path does not exist” error during manifest validation
Cause: Path listed in manifest file doesn’t exist or has a typo
Solution: Verify all paths with
lsand ensure absolute paths are used
Issue: “Permission denied” error during sync
Cause: Files not readable by JAWS service account
Solution: Fix permissions:
chmod -R 440 <REFDATA>/<your_group_name>/*and ensure correct group:chgrp -R <GROUP> <REFDATA>/<your_group_name>/*
Issue: Transfer never starts
Cause: Manifest file not detected or malformed
Solution: Ensure manifest file is named
<USERNAME>_changes.txtand is in the<REFDATA>root directory. Wait at least 20 minutes after editing.
Issue: Files not appearing in WDL at runtime
Cause: Files may not have synced to the execution site yet
Solution: Check
<REFDATA>/log/for transfer status. Wait for sync completion before running workflows.
Issue: Symlinks not working in WDL
Cause: Symlinks are not supported
Solution: Copy actual files instead of creating symlinks
Getting Help
If you encounter issues:
Check transfer logs in
<REFDATA>/log/Verify permissions with
ls -lContact JAWS support via SLACK with details of the issue and any relevant log file contents for faster troubleshooting.