======================== JAWS Performance Metrics ======================== .. role:: bash(code) :language: bash JAWS tracks performance metrics by analyzing HTCondor history logs collected across all compute sites. These logs are shipped via Filebeat to a centralized Elasticsearch backend, where they are indexed and made searchable. This document explains the key metrics shown on the JAWS Dashboard and how they can be used to assess compute and memory resource utilization. Units ----- 1. ``Cores``: Logical compute cores (or threads). 2. ``Seconds``: Real time that passes on the clock while the job runs. 3. ``Core-seconds``: Total time all cores combined spent actively working on the job. **Example:** A job running for 60 seconds using 2 cores = 120 core-seconds. 4. ``GB``: Gigabytes of memory used. Metrics ------- 1. ``RequestCores`` - **Unit:** Cores (Integer) - **Definition:** Logical cores (or threads) requested by the user. 2. ``CommittedSec`` - **Unit:** Seconds (Integer) - **Definition:** The total wall-clock time the job spent running successfully on a machine (i.e., it finished without failing). - Includes core time + idle time + I/O waits, etc. - Excludes time spent on failed attempts, retries, or time in the queue. 3. ``ActiveComputeSec`` - **Unit:** Core-seconds (Integer) - **Definition:** Total time cores spent actively computing for the job 4. ``AvgComputeCores`` - **Unit:** Cores (Float) - **Definition:** An estimation of how many cores were utilized on average during a job's final, successful run. - **Formula:** ``AvgComputeCores = ActiveComputeSec / CommittedSec`` **Example 1:** Mixed Concurrency Over Time * A job used **1 core for 60 seconds**, then **2 cores for 60 seconds**. * ``ActiveComputeSec`` = (1 core × 60 s) + (2 cores × 60 s) = **180 Core-seconds** * ``CommittedSec`` = 120 seconds * ``AvgComputeCores`` = 180 / 120 = **1.5 cores** * *Accurate representation of concurrency* **Example 2:** Consistent High Utilization * A job used **4 cores continuously for 90 seconds**. * ``ActiveComputeSec`` = 4 × 90 = **360 Core-seconds** * ``CommittedSec`` = 90 seconds * ``AvgComputeCores`` = 360 / 90 = **4.0 cores** * *Accurate representation of concurrency* **Example 3:** Burst Followed by Idle * A job used **4 cores for 10 seconds**, then was idle for **590 seconds**. * ``ActiveComputeSec`` = 4 × 10 = **40 Core-seconds** * ``CommittedSec`` = 600 seconds * ``AvgComputeCores`` = 40 / 600 = **0.066 cores** * *Despite briefly using 4 cores concurrently, the* ``AvgComputeCores`` *was very low due to extended idle time* * *This does not reflect peak concurrency* **Example 4:** Declining Core Usage * A job used **4 cores for 30 seconds**, then **1 core for 90 seconds**. * ``ActiveComputeSec`` = (4 × 30) + (1 × 90) = **120 + 90 = 210 Core-seconds** * ``CommittedSec`` = 120 seconds * ``AvgComputeCores`` = 210 / 120 = **1.75 cores** * *Concurrency declined over time* **Example 5:** Underutilization Despite High Request * A job **requested 4 cores**, but consistently **used only 2 cores for 300 seconds**. * ``ActiveComputeSec`` = 2 × 300 = **600 Core-seconds** * ``CommittedSec`` = 300 seconds * ``AvgComputeCores`` = 600 / 300 = **2.0 cores** * *Underused cores, despite* ``AvgComputeCores`` *being > 1.* 5. ``ComputeUseFactor`` - **Unit:** Unitless - **Definition:** The fraction of requested cores that were actively used for computing (on average) during the job's successful run. - **Formula:** ``ComputeUseFactor = AvgComputeCores / RequestCores`` 6. ``NonComputeSec`` - **Unit:** Seconds (Integer) - **Definition:** The portion of the job's runtime not actively spent on computation, including time spent in I/O waits, sleeping, blocking, or other non-CPU-bound activities. - **Formula:** ``NonComputeSec = CommittedSec − (ActiveComputeSec)`` .. note:: This metric is not currently calculated but may be added in a future update. 7. Low ``ComputeUseFactor`` (i.e., low average core usage relative to ``RequestCores``) does not necessarily imply low application code efficiency. However, if it is consistently low, - The workload is **I/O-bound** or **memory-bound** - The job is **not parallelized efficiently** - Fewer cores should be requested (``RequestCores`` may be over-provisioned). 8. ``PeakMemoryGB`` - **Unit:** GB (Float) - **Definition:** The peak memory used by the job during its successful run. 9. ``RequestMemoryGB`` - **Unit:** GB (Float) - **Definition:** The amount of memory requested by the job. 10. ``MemoryUseFactor`` - **Unit:** Unitless - **Definition:** The ratio of peak memory used to memory requested during the job's successful run. - **Formula:** ``MemoryUseFactor = PeakMemoryGB / RequestMemoryGB`` 11. Low ``MemoryUseFactor`` suggests memory over-allocation. In such cases, users are encouraged to reduce their memory requests. HTCondor Attribute Mapping (Optional) ------------------------------------- .. note:: This section is intended for users interested in understanding which HTCondor `attributes `_ JAWS metrics are based on. Most users can safely ignore this. .. list-table:: HTCondor Attribute Mapping :header-rows: 1 * - **JAWS Metric** - **HTCondor ClassAd Field** * - ``RequestCores`` - `RequestCpus `_ * - ``CommittedSec`` - `CommittedTime `_ * - ``ActiveComputeSec`` - `RemoteUserCpu `_ + `RemoteSysCpu `_ * - ``PeakMemoryGB`` - `MemoryUsage `_ (converted to GB) * - ``RequestMemoryGB`` - `RequestMemory `_ (converted to GB)