============================ JAWS Troubleshooting RoadMap ============================ .. role:: bash(code) :language: bash .. role:: red .. role:: darkorange When you receive a notification that your JAWS runs have :red:`failed`, here's a step-by-step guide on what to do next. Inspect the `jaws log` ====================== The first step in debugging a failed JAWS run is to inspect the output of the :bash:`jaws log` command. This command will provide key information about which stage of the run encountered an issue. .. dropdown:: Run the `jaws log` command :ref:`🔗 ` :color: info :animate: fade-in :name: jaws log Run the following command to inspect the log output for a specific run: .. code-block:: bash jaws log #STATUS_FROM STATUS_TO TIMESTAMP COMMENT created upload queued 2024-04-08 11:56:44 upload queued upload complete 2024-04-08 11:56:55 upload complete ready 2024-04-08 11:57:06 ready submitted 2024-04-09 14:09:47 submitted queued 2024-04-09 14:09:58 queued running 2024-04-09 14:10:42 running succeeded 2024-04-16 19:02:43 succeeded complete 2024-04-16 19:03:00 complete finished 2024-04-16 19:03:10 finished download succeeded 2024-04-16 19:03:22 download succeeded fix perms queued 2024-04-16 19:03:33 fix perms queued fix perms complete 2024-04-16 19:03:43 fix perms complete sync complete 2024-04-16 19:03:53 sync complete slack succeeded 2024-04-16 19:04:04 slack succeeded done 2024-04-16 19:04:14 The log displays a sequence of transitions between different stages of the run, along with timestamps and comments to help you understand where the issue occurred. .. dropdown:: Understanding the JAWS Run Stages :ref:`🔗 ` :color: info :animate: fade-in :name: Understanding the JAWS Run Stages Below is a breakdown of the typical stages of a successful JAWS run: 1. **created → upload queued**: The run is created and input data is queued for transfer to the compute site using Globus. 2. **upload queued → upload complete**: Data transfer is completed successfully. 3. **upload complete → ready**: The compute site has received the input data and is preparing to start the run. 4. **ready → submitted**: The run has been successfully submitted to Cromwell for task execution. 5. **submitted → queued**: Tasks are submitted to the compute cluster (HTCondor) and queued for execution. 6. **queued → running**: Tasks are actively being processed on the compute site. 7. **running → succeeded**: The tasks have completed successfully on Cromwell. 8. **succeeded → complete**: The run is finished. 9. **complete → finished**: The run is fully complete, and JAWS is updading `jaws tasks`. 10. **complete → download queued**: Output data is queued for transfer to the `JAWS Teams Directory`. 11. **download queued → download complete**: Output data transfer is complete. 12. **download complete → fix perms complete**: File permissions are verified and adjusted if necessary. 13. **fix perms complete → sync complete**: Complete. 14. **sync complete → slack succeeded**: Slack Message sent. 15. **slack succeeded → done**: The run is fully complete. :darkorange:`Non-Cromwell Common Failure Scenarios` =================================================== .. dropdown:: Scenario 1: Failed at Globus Transfer :ref:`🔗 ` :color: info :animate: fade-in :name: Failed at Globus Transfer After your run is created, JAWS will transfer the input data from the input site to the compute site using Globus. If an error occurs in this stage, you might see something like the following (note that other issues with Globus can also occur, leading to different error messages): .. code-block:: text jaws log #STATUS_FROM STATUS_TO TIMESTAMP COMMENT created upload failed 2024-08-13 16:08:15 No transfer method known for Transfer 3258 upload failed slack succeeded 2024-08-13 16:08:25 slack succeeded done 2024-08-13 16:08:35 **Explanation:** This error message indicates that the job failed during the transfer of input data from the the input site to the compute site. **Relevant Error Files:** In this case, no `cromwell-executions` folder was created. Therefore, the only error message can be found using the `jaws log` command. **Action:** This could be caused by issues with the Globus endpoint or network instability. Contact the JAWS team on Slack (#jaws channel) for assistance, providing the log details and the RUN_ID, and the JAWS team will be able to assist you in resolving it. .. dropdown:: Scenario 2: Failed at submitting to Cromwell :ref:`🔗 ` :color: info :animate: fade-in :name: Failed at submitting to Cromwell Once the input data is transferred, JAWS Site will submit the run to Cromwell. If an error occurs, you might see something like: .. code-block:: text jaws log #STATUS_FROM STATUS_TO TIMESTAMP COMMENT created upload queued 2024-07-01 15:23:10 upload queued upload complete 2024-07-01 15:23:10 upload complete submission failed 2024-07-01 15:23:43 Server timeout: The service is unable to respond at this time; please try again later. submission failed slack succeeded 2024-07-01 15:24:03 slack succeeded done 2024-07-01 15:24:19 **Explanation:** This log shows that the job failed during the submission to Cromwell, which indicates a temporary issue with the server, possibly due to high load or network problems. **Relevant Error Files:** In this case, no `cromwell-executions` folder was created. Therefore, the only error message can be found using the `jaws log` command. **Action:** Please contact the JAWS Team for further assistance. You can reach out via Slack in the #jaws channel. .. dropdown:: Scenario 3: Failed at Transfer the Ouputs :ref:`🔗 ` :color: info :animate: fade-in :name: Failed at Transfer the Ouputs Another error that can occur is when JAWS is transferring the output data to the JAWS Teams Directory (find more information :doc:`here `). JAWS also uses Globus for this transfer, and you might find a message like the following: .. code-block:: text jaws log #STATUS_FROM STATUS_TO TIMESTAMP COMMENT created upload queued 2024-04-08 11:56:44 upload queued upload complete 2024-04-08 11:56:55 upload complete ready 2024-04-08 11:57:06 ready submitted 2024-04-09 14:09:47 submitted queued 2024-04-09 14:09:58 queued running 2024-04-09 14:10:42 running succeeded 2024-04-16 19:02:43 succeeded complete 2024-04-16 19:03:00 complete finished 2024-04-16 19:03:10 finished download failed 2024-04-16 19:03:22 ('POST', 'https://transfer.api.globus.org/v0.10/transfer', 'Bearer', 502, 'ExternalError', "Error validating login to endpoint 'NERSC Perlmutter jaws Collab (5b869795-a6f8-4a87-9272-7c1851c25033)', Error (connect)\nEndpoint: NERSC Perlmutter jaws Collab (5b869795-a6f8-4a87-9272-7c1851c25033)\nServer: 128.55.64.33:443\nMessage: The operation timed out\n", 'vmKIUrP6K') download failed fix perms queued 2024-04-16 19:03:33 fix perms queued fix perms complete 2024-04-16 19:03:43 fix perms complete sync complete 2024-04-16 19:03:53 sync complete slack succeeded 2024-04-16 19:04:04 slack succeeded done 2024-04-16 19:04:14 **Explanation:** The run succeeded, but the output data transfer to the JAWS Teams Directory failed. **Relevant Error Files:** In this case, `cromwell-executions` folder was created. However, this error happened after cromwell execution folder, so the only error message can be found using the `jaws log` command. **Action:** This is likely a network or Globus issue. Run `jaws download` to attempt the transfer again. If the problem persists, contact the JAWS team. :darkorange:`Cromwell Common Failure Scenarios` =============================================== .. dropdown:: Scenario 4: Failed at Cromwell Execution :ref:`🔗 ` :color: info :animate: fade-in :name: Failed at Cromwell Execution **Explanation:** The Cromwell execution failed, which means one or more tasks did not complete successfully. Example: .. code-block:: text jaws log #STATUS_FROM STATUS_TO TIMESTAMP COMMENT created upload queued 2024-08-13 15:56:48 upload queued upload complete 2024-08-13 15:58:30 upload complete ready 2024-08-13 15:58:43 ready submitted 2024-08-13 15:58:46 submitted queued 2024-08-13 15:59:02 queued running 2024-08-13 15:59:47 running failed 2024-08-13 16:02:15 Cromwell execution failed failed complete 2024-08-13 16:02:47 complete finished 2024-08-13 16:03:00 finished download skipped 2024-08-13 16:03:05 download skipped slack succeeded 2024-08-13 16:03:15 slack succeeded done 2024-08-13 16:03:25 **Action:** In this case, we need more investigation. :ref:`Let's explore the other JAWS Commands `. .. _cromwell-execution-failed-what-to-do-next: Cromwell Execution Failed: What to do next? +++++++++++++++++++++++++++++++++++++++++++ If the :bash:`jaws log` indicates that the **Cromwell execution failed**, the next step is to investigate the specific tasks that failed. This can be done using the :bash:`jaws tasks` command, which provides detailed information about the status of each task within the workflow. Inspect the `jaws tasks` ++++++++++++++++++++++++ The `jaws tasks` command provides detailed information about which tasks were executed, including their names, their statuses, and their return codes. .. dropdown:: Run the `jaws tasks` command :ref:`🔗 ` :color: info :animate: fade-in :name: jaws tasks .. code-block:: bash jaws tasks #TASK_DIR JOB_ID STATUS QUEUE_START RUN_START RUN_END QUEUE_MIN RUN_MIN CACHED TASK_NAME REQ_CPU REQ_GB REQ_MIN CPU_HRS RETURN_CODE call-task1 532741 failed 2024-09-13 12:03:43 2024-09-13 12:10:43 2024-09-13 12:10:46 7 0 False runblastplus_sub.task1 1 1 20 0.0 2 **Key steps to troubleshoot jaws status:** .. dropdown:: Scenario 1: All tasks succeeded, but the run status reports failed :ref:`🔗 ` :color: info :animate: fade-in :name: Scenario 1 **Explanation:** In some cases, `jaws tasks` may report that all tasks succeeded, but the overall run still failed. **Possible Cause:** This can happen due to external issues unrelated to the task execution, Cromwell was unable to find an expected output file. **Relevant Error Files:** Inspect the `error.json` file for more details. .. dropdown:: How to find the `error.json` file? :ref:`🔗 ` :color: info :animate: fade-in :name: error json - Run the `jaws status ` Command: - First, you need to find the output directory for your run. You can do this by running the `jaws status ` command: .. code-block:: bash jaws status 86698 { "compute_site_id": "perlmutter", "cpu_hours": 0.0, "cromwell_run_id": "e2a3b977-0d73-4478-8613-56e601d166ce", "id": 86698, "input_site_id": "perlmutter", "json_file": "/global/u1/d/dcassol/JAWS/jaws-tutorial-examples/5min_example/inputs.json", "output_dir": "/pscratch/sd/j/jaws/perlmutter-prod/dsi-aa/dcassol/86698/e2a3b977-0d73-4478-8613-56e601d166ce", "result": "failed", "status": "done", "status_detail": "The run is complete.", "submitted": "2024-09-19 14:54:58", "tag": null, "team_id": "dsi-aa", "updated": "2024-09-19 15:06:13", "user_id": "dcassol", "wdl_file": "/global/u1/d/dcassol/JAWS/jaws-tutorial-examples/5min_example/align_final.wdl", "workflow_name": "bbtools", "workflow_root": "/pscratch/sd/j/jaws/perlmutter-prod/cromwell-executions/bbtools/e2a3b977-0d73-4478-8613-56e601d166ce" } - Look for the `output_dir` field in the command output. This field will provide the path to the directory where the output and error files are stored. - Navigate to the `output_dir` directory to locate the `error.json` file. This file contains detailed information about the errors that occurred during execution extracted from the Cromwell Metadata logs. **Action:** If `error.json` file shows a message about missing ouput files, where are two common reasons: 1. Filesystem Instability: The task generated the output file, but Cromwell couldn't find it due to a temporary filesystem issue. Solution: Use the `jaws resubmit` command. This will leverage task caching, meaning the previous outputs will be reused, and the resubmission should succeed. If the issue persists, contact the JAWS team for further assistance. 2. Output File Not Generated: The task didn't create the output file, even though the task itself returned a success code (return code `0`). Solution: Inspect the `stderr` file to understand why the file wasn't created despite the successful execution. This may indicate an issue with the `command stanza` not catching the exeption and/or some issue with the input data. .. dropdown:: Scenario 2: `jaws tasks` shows a failed task :ref:`🔗 ` :color: info :animate: fade-in :name: Scenario 2 Let's investigate which Return Code the task has returned. Go to next topic to understand the :ref:`Return Codes `. .. dropdown:: How to Get to the `cromwell-execution` Folder for the Failed Task? :ref:`🔗 ` :color: info :animate: fade-in :name: cromwell execution folder If the input site is the same as the comoute site, you can access the `workflow_root` folder directly. If the input site is different from the compute site, you need to download the failed cromwell-executions folders to your input site using the `jaws download` command. .. code-block:: bash jaws download 86698 { "download_id": 40351, "id": 86698, "status": "download queued" } Once the download completes, you can locate the `cromwell-executions` folder inside the JAWS Teams directory (`output_dir`). Within the `cromwell-executions` folder, you will find important logs like `stderr`, `stdout`, and `rc` (return code) files. These logs provide detailed information about task failures. .. _inspect-cromwell-return-codes: Inspect Cromwell Return Codes +++++++++++++++++++++++++++++ For each task or shard executed in Cromwell, a return code is assigned that indicates whether the task succeeded or failed. These return codes help us understand the status of individual tasks within a workflow and provide insight into potential issues that require troubleshooting. .. dropdown:: How to get to the Return code? :ref:`🔗 ` :color: info :animate: fade-in :name: How to get to the Return code You can inspect return codes using the `jaws tasks` command, which provides a `RETURN_CODE` column for each task. Additionally, each Cromwell execution folder contains a `rc` file that stores the return code of the corresponding task. This guide focuses on understanding these return codes, with particular emphasis on `return code 79`, which is commonly used by Cromwell. .. dropdown:: Common Return Codes and Their Meanings :ref:`🔗 ` :color: info :animate: fade-in :name: Common Return Codes and Their Meanings +-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ | # | Description | +=======+===========================================================================================================================================================+ | **0** | The task completed successfully. | +-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ | **1** | A generic error occurred during task execution. This is often caused by issues in the task script. Check the `stderr` for the root cause. | +-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ | **2** | This code is used when the task encounters a specific error related to incorrect input or improper usage of the task script. | | | Check the `stderr` for the root cause. | +-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ |**255**| Typically signifies an abnormal termination, indicating that the process did not exit as expected. | +-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ |**137**| This return code occurs when the task exceeds the allocated memory and is terminated. Adjust the memory request in the WDL or task configuration to | | | resolve this issue. | +-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ |**79** | Cromwell sets return code `79` when the task is terminated by Cromwell. This can happen due to several reasons, which are detailed below. | +-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ .. dropdown:: Understanding Return Code `79` :ref:`🔗 ` :color: info :animate: fade-in :name: Understanding Return Code 79 Return code `79` is a special code used by Cromwell to indicate that the task was terminated by Cromwell itself. .. admonition:: According to Cromwell's code comments: If SIGTERM, SIGKILL or SIGINT codes are used, cromwell will assume the job has been aborted by cromwell. Since it is arbitrary which code is chosen from that range, and it has to relate with the unpleasant business of 'killing'. 79 was chosen. This can happen for several reasons, including file system instability, container issues, or the failure to generate expected output files. .. dropdown:: **Container Image Not Found (*.sif file not found)**: :color: info :animate: fade-in - **Symptom**: The task fails with return code `79` because the container image file (e.g., .sif file) was not found during execution. - **Cause**: File system instability during the pulling the image. The task will not start running, and Cromwell will set return code `79`. - **Action**: Check the `stderr.submit` file for error messages. Use `jaws resubmit` to retry the run. If the issue persists, contact the JAWS team for further investigation. .. dropdown:: **File System Instability and Retry Logic in JAWS**: :color: info :animate: fade-in - **Symptom**: The task is terminated with return code `79` due to file system instability during the script execution. - **Action**: JAWS automatically triggers a retry mechanism when the fist `return code` is `79`. - **Example**: .. code-block:: bash cat cromwell-executions//79272074-ae28-40cd-8715-cb109a73b6e2/call-helloWDL/execution/stderr.submit 2024-09-15 20:48:55: ERROR: task execution failed with the return code 79 that may be caused by a system issue. Retrial: 1 cat cromwell-executions//79272074-ae28-40cd-8715-cb109a73b6e2/call-helloWDL/execution/rc.1 79 cromwell-executions//79272074-ae28-40cd-8715-cb109a73b6e2/call-helloWDL/execution/stdout.submit Submitting job(s). 1 job(s) submitted to cluster 1711957. 2024-09-15 20:49:56: INFO: task execution successful after 2 attempt(s). Return code = 0 .. dropdown:: **Expected Output File Not Found**: :color: info :animate: fade-in - **Symptom**: Cromwell cannot find the expected output file after the task execution, and the return code `79` is set in the rc file. You will find the error message in the `error.json` file. - This can happen in two scenarios: - **Expected files were generated**: The file system instability prevented Cromwell from confirming the output files, resulting in return code `79`. Usually, the return code from the script was `0`, and because Cromwell was not able to confirm the output files, it set the return code to `79`. **Action**: In this case, a `jaws resubmit ` will usually resolve the issue. Please confirm that the expected output files were generated before resubmitting the run. - **Expected files were not generated**: The script failed to create the output files. Cromwell may have set return code `79` because it couldn't find the output, even though the script returned `0`. **Action**: Investigate the `stderr` and the script's logic. It may involve an uncaught exception. This might require editing the WDL or input data, followed by a `jaws submit` to rerun the workflow. :darkorange:`WDL Validation Issues` =================================== .. dropdown:: Scenario: `jaws.validate()` fails with ShellCheck warning :ref:`🔗 ` :color: info :animate: fade-in :name: shellcheck-validation **Explanation:** When you run :bash:`jaws.validate(shell_check=True, wdl_file="your_workflow.wdl")`, JAWS attempts to validate shell commands inside each WDL task using `shellcheck`. If `shellcheck` is not installed on your system (and you're not using the container version of JAWS), you may see: .. code-block:: text * Suggestion: install shellcheck (www.shellcheck.net) to check task commands. (--suppress CommandShellCheck suppresses this message) **Solutions** - **Install ShellCheck** To resolve this, you need to install `shellcheck` on your system. If you are using a conda environment, you can install `shellcheck` using the following command: .. code-block:: bash conda install -c bioconda shellcheck You can verify installation with: .. code-block:: bash shellcheck --version - **Suppressing the Warning:** If you prefer to skip `ShellCheck` validation, simply use: .. code-block:: python jaws.validate(shell_check=False)