=================================
Using Reference Data in Your WDLs
=================================

.. role:: bash(code)
   :language: bash

Overview
========

JAWS provides a centralized reference data system that allows you to store and access large, reusable datasets (such as BLAST databases, genome references, or annotation files) 
without copying them into every workflow submission. This approach:

- **Saves time**: No need to upload or stage large files repeatedly
- **Reduces storage**: Single copy shared across teams and users
- **Ensures consistency**: All JAWS compute sites access the same synchronized data
- **Improves reproducibility**: Reference data versions are stable and accessible across sites

Reference data is stored on NERSC Perlmutter and automatically synchronized to all JAWS execution sites. Inside your WDL workflows, reference data is accessible via the ``/refdata`` mount point.


Before You Start
================

Understanding Project Paths and Groups
--------------------------------------

Your project determines which base path you use, the compute sites where your data will sync, and the required Linux group ownership. Identify your project below:

+------------+----------------------------------------------+--------------------------------------+---------------+
|  Project   |  Refdata Base Path (``<REFDATA>``)           |  Syncs to Sites                      |  Linux Group  |
+============+==============================================+======================================+===============+
|  JGI       | ``/global/dna/shared/databases/jaws/refdata``|  tahoma, dori, jgi, crux, defiant    |  ``gemone``   |
+------------+----------------------------------------------+--------------------------------------+---------------+
|  NMDC      | ``/global/cfs/cdirs/m3408/refdata``          |  nmdc_tahoma                         |  ``m3408``    |
+------------+----------------------------------------------+--------------------------------------+---------------+
|  KBase     | ``/global/dna/kbase/reference/jaws``         |  jgi                                 |  ``kbase``    |
+------------+----------------------------------------------+--------------------------------------+---------------+
|  SuperBio  | ``/global/cfs/cdirs/m4731/refdata``          |  NA                                  |  ``slacjaw``  |
+------------+----------------------------------------------+--------------------------------------+---------------+
|  RNome     | ``/global/cfs/cdirs/m5243/refdata``          |  NA                                  |  ``m5243``    |
+------------+----------------------------------------------+--------------------------------------+---------------+

.. note::
   
   Throughout this guide, ``<REFDATA>`` refers to your project's base path from the table above.

Prerequisites
-------------

Before adding reference data, ensure you have:

- An account on NERSC Perlmutter
- Membership in the appropriate Linux group (see table above)
- SSH access to NERSC DTN (Data Transfer Nodes)


Initial Setup: Creating Your Reference Data Folder
===================================================

This is a **one-time setup** to establish your group's folder within the project refdata directory.

Step 1: Log into a DTN Node
----------------------------

Refdata directories are **read-only on compute nodes**. You must use a Data Transfer Node (DTN) to create folders and copy files.

.. code-block:: bash

   ssh dtn0[1-4].nersc.gov

More details on DTN access: `NERSC DTN Access <https://docs.nersc.gov/systems/dtn/#access>`_

Step 2: Create Your Group Folder
---------------------------------

Create a subdirectory under your project's refdata path:

.. code-block:: bash

   mkdir <REFDATA>/<your_group_name>

**Example for JGI users:**

.. code-block:: bash

   mkdir /global/dna/shared/databases/jaws/refdata/myteam_references_dataset

Step 3: Set Correct Group Ownership
------------------------------------

Ensure your folder belongs to the correct Linux group:

.. code-block:: bash

   chgrp <GROUP> <REFDATA>/<your_group_name>

**Example:**

.. code-block:: bash

   chgrp gemone /global/dna/shared/databases/jaws/refdata/myteam_references_dataset

Step 4: Set Permissions
------------------------

Set directory permissions to allow group read/write access:

.. code-block:: bash

   chmod 440 <REFDATA>/<your_group_name>

This ensures:

- Owner and group can read and write files
- Others have no access

**Verify permissions:**

.. code-block:: bash

   ls -ld <REFDATA>/<your_group_name>

You should see something like: ``drwxr-s---``

.. warning::
   
   Files and directories **must be readable by the JAWS service account**. At minimum, ensure group read permissions (``440`` for files, ``550`` for directories). 
   

Adding and Managing Data
=========================

Step 1: Copy Your Data to the DTN
----------------------------------

You can copy data from several locations on NERSC or for external sources. NERSC locations include:

- ``$HOME`` (your home directory)
- ``/global/cfs`` (Community File System)
- ``/global/pscratch`` (Perlmutter scratch)


Step 2: Verify File Permissions
--------------------------------

Ensure all files have correct group ownership and are readable by the JAWS service account:

.. code-block:: bash

   ls -l <REFDATA>/<your_group_name>/

Expected file permissions: ``-r--r-----`` (440) or better.

If needed, recursively fix permissions:

.. code-block:: bash

   chgrp -R <GROUP> <REFDATA>/<your_group_name>/
   chmod -R 440 <REFDATA>/<your_group_name>/*

.. warning::
   
   **Symlinks are not supported.** Do not use symbolic links (e.g., ``latest -> v10.4``). Symlinks will not sync to remote sites. Always copy the actual files.

Step 3: Create or Update the Manifest File
-------------------------------------------

To trigger synchronization to all JAWS sites, you must create a **manifest file** that lists the data to sync.

.. important::
   
   **Manifest file naming convention:** ``<USERNAME>_changes.txt``
   
   Example: ``dcassol_changes.txt``

**Create your manifest file in the refdata root:**

.. code-block:: bash

   vim <REFDATA>/<USERNAME>_changes.txt

**Add full paths to your data** (one path per line):

.. code-block:: text

   /global/dna/shared/databases/jaws/refdata/myteam_references_dataset/hg38.fa
   /global/dna/shared/databases/jaws/refdata/myteam_references_dataset/hg38.fa.fai
   /global/dna/shared/databases/jaws/refdata/myteam_references_dataset/annotations/

.. note::
   
   - You can specify individual files or entire directories
   - All paths must be **absolute paths**
   - Paths must **exist** and be **readable by the JAWS service account**
   - Adding a directory path will sync all contents recursively

**Best Practice: Clear the manifest after each sync**

If you add new paths to an existing manifest, Globus will re-validate all listed paths (old and new). 
Although Globus won't re-copy unchanged files, listing them causes unnecessary processing. 
To avoid confusion, clear the manifest after successful sync, then add only new paths for future syncs.

Step 4: Wait for Sync to Complete
----------------------------------

A background daemon checks for manifest changes **every 20 minutes**. When your manifest is detected:

1. The daemon validates all paths (must be absolute, existing, and readable)
2. Paths are submitted to the Globus API for transfer
3. Globus copies files to all remote JAWS sites
4. Status logs are written to ``<REFDATA>/log/<USERNAME>_changes.txt.<SITE>.complete``

**Typical sync time:** Transfers usually initiate within 20 minutes and complete based on data size and network speed.

Step 5: Check Transfer Status
------------------------------

After transfer completion, look for status files in the log directory:

**Success:**

.. code-block:: bash

   ls <REFDATA>/log/<USERNAME>_changes.txt_<SITE>.complete

**Failure:**

.. code-block:: bash

   ls <REFDATA>/log/<USERNAME>_changes.txt_<SITE>.failed

.. hint::
   
   If a transfer fails, check the corresponding log file for error details. 
   Common issues include incorrect permissions, missing files, or invalid paths. The JAWS support team is automatically alerted to failures.
   Howewer, please contact jaws support via SLACK with details of the failure to expedite troubleshooting.


Removing Files
==============

To remove reference data from all JAWS sites:

**Step 1: Delete from Source**

On the DTN, delete files from your group folder:

.. code-block:: bash

   rm <REFDATA>/<your_group_name>/old_database/*

**Step 2: Wait for Monthly Sync**

File deletions propagate automatically during the **monthly full sync**. You do not need to update your manifest file or take any additional action.

.. warning::
   
   Deleted files will remain accessible on remote sites **until the next monthly sync** completes.


How Synchronization Works
==========================

Understanding the automation behind refdata sync can help you troubleshoot issues and plan your data management.

Background Daemons
------------------

1. **Manifest Monitor Daemon**: Runs every 20 minutes
   
   - Detects new or modified ``<USERNAME>_changes.txt`` files
   - Validates that all listed paths are:
     
     - Absolute paths (start with ``/``)
     - Existing on the filesystem
     - Readable by the JAWS service account
   
   - Submits validated paths to the Globus API for transfer

2. **Transfer Monitor Daemon**: Runs continuously
   
   - Monitors active Globus transfers via API
   - Writes progress to log files
   - Updates ElasticSearch for dashboard visibility (future feature)

3. **Alerts**: 
   
   - JAWS support team is notified of all transfer failures

Monthly Full Sync
-----------------

Once per month, Globus performs a **full directory sync** from Perlmutter to all remote sites. This sync:

- Ensures all sites have identical copies
- **Deletes files at destination sites that no longer exist at the source**

This is how file removals propagate (see previous section).


Using Reference Data in WDLs
=============================

Accessing Reference Data Paths
-------------------------------

Inside your WDL workflows, reference data is mounted at ``/refdata``. The structure mirrors your group folder on Perlmutter:

- **On Perlmutter**: ``<REFDATA>/<your_group_name>/database.fa``
- **In WDL container**: ``/refdata/<your_group_name>/database.fa``

Critical: Use ``String`` Type, Not ``File``
--------------------------------------------

.. warning::
   
   **Always declare reference data paths as ``String``, not ``File``.**
   
   WDL variables declared as ``File`` are staged into Cromwell's execution directory. 
   Since ``/refdata`` does not exist outside the container, Cromwell will fail to validate the path during input processing.

**Correct:**

.. code-block:: text

   String reference_db = "/refdata/<your_group_name>/hg38.fa"

**Incorrect:**

.. code-block:: text

   File reference_db = "/refdata/<your_group_name>/hg38.fa"  # Will fail!

Example WDL: BLAST Against Reference Database
----------------------------------------------

This example demonstrates a realistic use case: running BLAST against a reference database stored in ``/refdata``.

.. dropdown:: **WDL (blast_workflow.wdl):**
    :color: info
    :animate: fade-in
    
    WDL Example

    .. code-block:: text

      version 1.0
      
      workflow blast_search {
        input {
          File query_sequences
          String blast_db_path
          String blast_db_name
          Int num_threads = 4
        }
        
        call run_blastn {
          input:
            query = query_sequences,
            db_path = blast_db_path,
            db_name = blast_db_name,
            threads = num_threads
        }
        
        output {
          File blast_results = run_blastn.results
          File blast_log = run_blastn.log
        }
      }
      
      task run_blastn {
        input {
          File query
          String db_path      # Path to refdata directory (e.g., /refdata/blast_dbs)
          String db_name      # Database name (e.g., nt)
          Int threads
        }
        
        command <<<
          set -euo pipefail
          
          # The database path is mounted at /refdata inside the container
          # Construct full database path: /refdata/blast_dbs/nt
          DB_FULL_PATH="~{db_path}/~{db_name}"
          
          echo "Running BLASTN against ${DB_FULL_PATH}" > blast.log
          echo "Query file: ~{query}" >> blast.log
          
          blastn \
            -query ~{query} \
            -db ${DB_FULL_PATH} \
            -out results.txt \
            -outfmt 6 \
            -num_threads ~{threads} \
            -max_target_seqs 10
          
          echo "BLASTN completed successfully" >> blast.log
        >>>
        
        runtime {
          docker: "ncbi/blast:latest"
          cpu: threads
          memory: "8 GB"
          runtime_minutes: 120
        }
        
        output {
          File results = "results.txt"
          File log = "blast.log"
        }
      }

    **Inputs JSON (blast_inputs.json):**

    .. code-block:: json

      {
        "blast_search.query_sequences": "/path/to/my_sequences.fasta",
        "blast_search.blast_db_path": "/refdata/blast_dbs",
        "blast_search.blast_db_name": "nt",
        "blast_search.num_threads": 8
      }

**Key points:**

- ``blast_db_path`` is declared as ``String`` (not ``File``)
- The path ``/refdata/blast_dbs/nt`` refers to the BLAST database index files
- Inside the container, ``/refdata`` is automatically mounted by Cromwell
- The actual files on Perlmutter are in ``<REFDATA>/blast_dbs/nt.*``


Troubleshooting
===============

Common Issues
-------------

**Issue: "Path does not exist" error during manifest validation**

- **Cause**: Path listed in manifest file doesn't exist or has a typo
- **Solution**: Verify all paths with ``ls`` and ensure absolute paths are used

**Issue: "Permission denied" error during sync**

- **Cause**: Files not readable by JAWS service account
- **Solution**: Fix permissions: ``chmod -R 440 <REFDATA>/<your_group_name>/*`` and ensure correct group: ``chgrp -R <GROUP> <REFDATA>/<your_group_name>/*``

**Issue: Transfer never starts**

- **Cause**: Manifest file not detected or malformed
- **Solution**: Ensure manifest file is named ``<USERNAME>_changes.txt`` and is in the ``<REFDATA>`` root directory. Wait at least 20 minutes after editing.

**Issue: Files not appearing in WDL at runtime**

- **Cause**: Files may not have synced to the execution site yet
- **Solution**: Check ``<REFDATA>/log/`` for transfer status. Wait for sync completion before running workflows.

**Issue: Symlinks not working in WDL**

- **Cause**: Symlinks are not supported
- **Solution**: Copy actual files instead of creating symlinks

Getting Help
------------

If you encounter issues:

1. Check transfer logs in ``<REFDATA>/log/``
2. Verify permissions with ``ls -l``
3. Contact JAWS support via SLACK with details of the issue and any relevant log file contents for faster troubleshooting.