19 Aug 2025

Feedback

History / Edit / PDF / EPUB / BIB / 1 min read (~39 words)
processes ai-feedback llm=chatgpt-4.1

Every week, on Friday.

15 minutes.

Gather feedback from team members.
Review feedback and identify key themes.
Identify action items based on the feedback.
Identify a date at which to revisit the feedback and the action items.

02 Aug 2025

Daily review

History / Edit / PDF / EPUB / BIB / 1 min read (~100 words)
processes ai-feedback llm=gpt-oss-20b work

Every daily at the end of the (work) day.

15 minutes.

Review what was planned for the day (manual)
- Did you finish each planned task?
- Were any deliverables missed or delayed?
- How many times I was interrupted?
- How much time I spent on unexpected work?
- Was I got blocked and for how long?
- Rate the quality of work delivered.
Provide feedback related to the plan (manual)
Learning & Insights (manual)
- Did you discover a new tool, technique, or process?
- What could be improved next time?
Review weekly plan and align
Plan next day

06 Jul 2025

GPU cluster monitoring

History / Edit / PDF / EPUB / BIB / 2 min read (~362 words)
machine-learning partially-ai-generated ai-feedback llm=claude-sonnet-4 llm=gpt-oss-120b

In this article I list metrics and alerts one should have when monitoring a GPU cluster to ensure efficient utilization of resources.

GPU cluster monitoring is critical for organizations to optimally utilize the limited capacity they have.
Without monitoring it is easy for users to leave jobs running that do not use GPU resources, or do not use them efficiently.
In some cases GPU clusters use certain technologies that require the users to provide images with specific libraries, and not including those dependencies can result in significantly worse compute performance.

Allocated GPUs
- Used to determine who (or which project) has GPU allocated (i.e., currently assigned to a running workload)
GPU utilization
- Used to determine whether the GPU is partially or fully used, and if it is partially used, to potentially identify the causes
GPU memory utilization
- Used to determine if the GPU memory is partially or fully used
- Used to identify out of memory issues and potential memory leaks
InfiniBand receive/transmit bytes
- Used to determine if a workload is making use of the technology
Job launch wait duration
- Used to determine when there's queueing of jobs due to compute being exhausted and how long it takes for jobs to start
Job duration
- Used to gather statistics about the type of workload running on the cluster in order to make informed decisions

Allocated GPUs are used
- Used to detect jobs that may ask multiple GPUs but end up using 1 or only a few of them
GPU utilization below threshold (<10%)
- Used to detect workloads that do not make full use of the GPU or are allocated to an oversized GPU
GPU utilization above threshold (>90%)
- Used to detect when the GPU is saturated
GPU utilization range above threshold (>25%)
- Used to detect uneven distribution of GPU compute workload
GPU memory utilization below threshold (<10%)
- Used to detect workloads that do not make full use of the GPU or are allocated to an oversized GPU
GPU memory utilization above threshold (>95%)
- Used to detect when a job is about to run out of GPU memory
InfiniBand receive/transmit > 0 when running multi-node workloads
- Used to identify workloads that are not properly configured to use InfiniBand

11 Jun 2024

Written by

History / Edit / PDF / EPUB / BIB / 1 min read (~187 words)
partially-ai-generated ai-feedback llm=gpt-oss-20b

All articles on this blog originate from my mind.
Most articles are written by me, but some are partially or entirely AI/LLMs‑generated.

Those articles will be tagged accordingly:

No tag for completely original content.
partially-ai-generated for articles with one or many AI-generated sentences or with some feedback provided by AI.
This covers articles where there is 1 word changed by AI to the article being almost entirely written by AI but with some human input.
fully-ai-generated when all the content is AI-generated.
This covers articles that are entirely written by AI without any human input (except for possibly removing sentences).

I also use additional tags in relation to AI usage, namely:

ai-feedback for articles that were edited following AI feedback.

I tag the articles with the LLMs that were involved.
Look for tags starting with llm=.

I use a variety of LLM providers (in order of frequency of use):

https://sive.rs/ai

07 Mar 2021

Achieving goals

History / Edit / PDF / EPUB / BIB / 1 min read (~150 words)
goals partially-ai-generated ai-feedback llm=gpt-oss-120b

It's important to know what your goals are.
It's important to understand why they are your goals.
It's important to determine which goals are more important than others (goals priority).
It's important to know which goals are dependent on other goals (goals decomposition and dependency).
To reach a goal, you must first acquire the tools (knowledge, resources) to get to your objective.
It's important to know when to drop/abandon goals.
Sources of inefficiency
- Repeating the same task without sufficient experience.
Always try to figure out the most optimal path toward a goal
- Observe others successful at achieving the goal you want to achieve.
- Determine the differences between your state and theirs (what they know, what resources are available to them, etc.).
How to determine when it is not possible to reach a goal at a given moment in time?
- Not enough time available
- Too costly
- Dependencies not resolved/ready

tomrochette.com

Feedback

Daily review

GPU cluster monitoring

Written by

Achieving goals

Tags

Archives

Syndicate