JAWS Performance Metrics

JAWS tracks per-task performance metrics by analyzing HTCondor logs collected across all compute sites. These logs are forwarded to a centralized Elasticsearch backend, where a dedicated service processes them and computes various metrics. The results are visualized on the JAWS performance dashboard.

Stored Metrics

The dashboard captures key metrics including memory usage, CPU efficiency, and wallclock time for each executed task.

Example view of stored metrics

Memory Metrics

Formula:

The memory efficiency of each task is calculated as:

Memory Efficiency = (Memory Usage / Requested Memory) × 100
Memory metrics panel
  • Memory Usage: is an integer expression in units of Mbytes that represents the peak memory usage for the job.

  • Requested Memory: The amount of memory requested for the task in the WDL runtime block.

CPU Metrics

CPU efficiency is calculated as:

CPU Efficiency = (Effective CPU Usage / Requested CPUs) × 100
CPU metrics panel CPU metrics panel CPU metrics panel
  • CpusUsage: HTCondor defines CPUsUsage as below:

“A cpu-bound, single-threaded job will have a CpusUsage of 1.0. A job that is blocked on I/O for half of its life and is cpu bound for the other have will have a CpusUsage of 0.5. A job that uses two cores fully will have a CpusUsage of 2.0. Jobs with unexpectedly low CpusUsage may be showing lowered throughput due to blocking on network or disk.”

  • However, we found that this value is missing for some tasks and is undefined for very short jobs.

  • We thus calculate Effective CPUs Usage which is a metric close to the above CPUs Usage.

Effective CpusUsage = (RemoteSysCpu + RemoteUserCpu) / CommittedTime
  • RemoteSysCPU is the total number of seconds of system CPU time (the time spent at system calls) the job used.

  • RemoteUserCpu is the total number of seconds of user CPU time the job used.

  • CommittedTime is the number of seconds of wall clock time that the job has been allocated a machine.

  • These values are sourced from the OS’s rusage data and are more reliable.

Wallclock Time

Cumulative number of seconds the job has been allocated a machine.