================================= Using Reference Data in Your WDLs ================================= .. role:: bash(code) :language: bash JAWS provides a dedicated location to store large, reusable files (e.g., NR databases) that won't be copied every time a WDL is submitted. This saves time and resources while ensuring data availability across all JAWS sites. Creating Your Reference Data Folder =================================== On Perlmutter, you can create a reference data folder based on your project. Ensure the appropriate Linux group and permissions are set for your files. +---------+----------------------------------------------+---------------------------+---------------+ | USERS | Refdata Path | Sync Sites Locations | Linux Group | +=========+==============================================+===========================+===============+ | JGI | `/global/dna/shared/databases/jaws/refdata` | `tahoma`, `dori`, `jgi` | `gemone` | +---------+----------------------------------------------+---------------------------+---------------+ | NMDC | `/global/cfs/cdirs/m3408/refdata` | `nmdc_tahoma` | `m3408` | +---------+----------------------------------------------+---------------------------+---------------+ | KBase | `/global/dna/kbase/reference/jaws` | `jgi` | `kbase` | +---------+----------------------------------------------+---------------------------+---------------+ - Files must belong to the correct Linux group with minimum permissions of `440` (read-only for group and owner). - Create a folder under the appropriate project path, e.g., :bash:`/`, but you need to log into a dtn node (i.e. ssh `dtn`) since they are read-only nodes. - When files are added to these folders, they will automatically sync to all associated JAWS sites. - You can access these files within your WDL using the path: :bash:`/refdata/`. Adding Data to `refdata` Directory ================================== **1. Accessing the Directory**: Log into a `DTN` node (e.g., ssh `dtn`) to create and manage your folders and files, as compute nodes are read-only. **2. Setting Permissions**: Ensure your folders and files are readable by the `jaws` service account user. Recommended permission: (i.e. `drwxrwsr-x+`). Check group permissions with `ls -l` and set them with `chmod` and `chgrp`. **3. Copying Data**: You can copy data from: - Global home - Global common - Community File System (CFS) Note: `/pscratch` is not accessible directly unless using Globus. **4. Using Globus for Fast Transfers**: Globus is the fastest and most reliable method for copying large data. Use the `NERSC Perlmutter` ==> `NERSC DTN` endpoints. Globus can read data from `/pscratch`. **5. NO Symlinks**: Symlinks (e.g., `latest -> v10.4`) are not supported and will not be maintained across sites. Always copy the actual files. **6. Updating the Manifest File**: Besides adding your files to your group folder, you need to cut and paste the full paths to a "manifest file" that is saved in the refdata root. You create your own manifest file and it should be named like `_changes.txt`. Modifying this file will trigger globus to copy your files to all the other JAWS sites. For example: - If I added `//foobar/tiny.fastq` to `_changes.`txt`, in 20mins or less, globus would initiate a copy to all sites. - Of course you can add folders, e.g. `//foobar` would copy everything. - If you add new file paths to an existing manifest that already contains previously copied paths, all of them — including the old ones — will be included in the list for copying. However, Globus won't re-copy a file if it already exists at the destination and hasn't changed. The best practice is to clear the manifest before adding new file paths to avoid unnecessary confusion. **7. Transfer Status Check**: After the successful completion of the refdata Globus transfer, `_changes.txt_.json..submitted.SUCCEEDED` file will be created under `/log`. If failed, `_changes.txt_.json..submitted.FAILED` file will be created How it Works ============ There is a daemon running in the background that checks every 20 minutes for any modifications in a `_changes.txt` file. If a new file is created or the contents are changed of an existing one, this daemon will will validate that the files paths are: 1. full paths 2. existing paths 3. have read permission by the `jaws` service account user. The daemon will gather all the paths from any `_changes.txt` file that has been modified and send them to the globus API for transfer. Another daemon will monitor the transfer via globus API and write to a log every so often. This log is monitored by our monitoring system (the ElasticSearch Stack) so that you'll be able to see the status on the JAWS dashboard (not available yet). Finally, the JAWS team will also be alerted if there is a transfer failure. Removing Files ============== When you want to remove files, just delete them from your group folder. You don't need to do anything else because once a month, globus will copy all folders and in doing so will delete anything on the destination site that is not on the source site. How to use `refdata` in your WDLs ================================= Use `/refdata` in your WDLs as the root. For example, if you wanted to run a blast command in your WDL, you would point to the database like: `blastn -db /refdata/nt_test/nt` where `nt_test` is where you saved all the blast index files. .. hint:: In your WDL, the input type for `refdata` files should be specified as `String` and not `File`. Variables specified with `File` are copied into Cromwell's working directory, and since `/refdata` doesn't exist outside the container, JAWS will fail to validate the path and you'll get an error. .. dropdown:: Example :color: info :animate: fade-in WDL Example .. code-block:: text version 1.0 workflow refdata_wf { call task1 { } } task task1 { command <<< # How to access reference data. The command is being run in a # docker container and the path to refdata outside the # container is mounted as "/refdata" inside the container. The # mounting of which happens in the cromwell config file. ls /refdata/nt_test >>> runtime { docker: "ubuntu:latest" cpu: 1 memory: "1G" } output { String outfile = stdout() } }