02 Aug 2025

Daily review

History / Edit / PDF / EPUB / BIB / 1 min read (~62 words)
processes

Every daily at the end of the (work) day.

15 minutes.

  • Review what was planned for the day
  • Provide feedback related to the plan
  • Track and review
    • How many times I was interrupted
    • How much time I spent on unexpected work
    • Whether I got blocked and for how long
  • Review weekly plan and align
  • Plan next day
02 Aug 2025

Weekly review

History / Edit / PDF / EPUB / BIB / 1 min read (~51 words)
processes

Every week, either at the beginning or end of the week.

15 minutes.

  • Review what was planned for the week
  • Provide feedback related to the plan
    • Write down what was worked on that wasn't part of the plan
  • Review monthly plan and align
  • Plan next week

In this article I list metrics and alerts one should have when monitoring a GPU cluster to ensure efficient utilization of resources.

GPU cluster monitoring is critical for organizations to optimally utilize the limited capacity they have.
Without monitoring it is easy for users to leave jobs running that do not use GPU resources, or do not use them efficiently.
In some cases GPU clusters use certain technologies that require the users to provide images with specific libraries, and not including those dependencies can result in significantly worse compute performance.

  • Allocated GPUs
    • Used to determine who (or which project) has GPU allocated (i.e., currently assigned to a running workload)
  • GPU utilization
    • Used to determine whether the GPU is partially or fully used, and if it is partially used, to potentially identify the causes
  • GPU memory utilization
    • Used to determine if the GPU memory is partially or fully used
    • Used to identify out of memory issues and potential memory leaks
  • InfiniBand receive/transmit bytes
    • Used to determine if a workload is making use of the technology
  • Job launch wait duration
    • Used to determine when there's queueing of jobs due to compute being exhausted and how long it takes for jobs to start
  • Job duration
    • Used to gather statistics about the type of workload running on the cluster in order to make informed decisions

  • Allocated GPUs are used
    • Used to detect jobs that may ask multiple GPUs but end up using 1 or only a few of them
  • GPU utilization below threshold (<10%)
    • Used to detect workloads that do not make full use of the GPU or are allocated to an oversized GPU
  • GPU utilization above threshold (>90%)
    • Used to detect when the GPU is saturated
  • GPU utilization range above threshold (>25%)
    • Used to detect uneven distribution of GPU compute workload
  • GPU memory utilization below threshold (<10%)
    • Used to detect workloads that do not make full use of the GPU or are allocated to an oversized GPU
  • GPU memory utilization above threshold (>95%)
    • Used to detect when a job is about to run out of GPU memory
  • InfiniBand receive/transmit > 0 when running multi-node workloads
    • Used to identify workloads that are not properly configured to use InfiniBand
14 May 2025

Quotes

History / Edit / PDF / EPUB / BIB / 1 min read (~74 words)
quotes fully-ai-generated llm=chatgpt-4o

If the behavior is chronic and unproductive, decide how much engagement is worthwhile.

Sometimes planting a seed is better than trying to change their mind in the moment.

Some people are resistant to new perspectives. If they refuse to engage, focus on managing your own reaction rather than changing theirs.

Choose your battles: Not every conversation is worth having. Consider whether it's worth investing time and energy into trying to change someone's mind.

21 Feb 2025

Learning - 2025

History / Edit / PDF / EPUB / BIB / 1 min read (~47 words)
learning
  • Model context protocol (MCP)
  • dstack
  • Nebius
  • Multi-node training and inference
  • DeepSpeed
  • Claude code

  • CometML
  • How to use LLM more in my development process
  • ML model profiling and optimization
  • GCP GCS performance profiling
  • NVIDIA MPS
  • NVIDIA MIG
  • NVIDIA time-slicing
  • NVIDIA vGPU
  • NVIDIA KAI scheduler
  • NVIDIA run:ai