Ouroboros is an autonomous AI agent that works on tasks provided to it as well as continuously improving itself.
Named after the ancient symbol of a serpent eating its own tail - representing infinity and cyclic renewal - Ouroboros implements a continuous loop of Do → Learn → Improve → Retry.

Unlike traditional AI assistants that wait for commands and forget context between sessions, Ouroboros:

  • Runs indefinitely without human intervention
  • Maintains persistent memory of everything it has done
  • Reflects on its performance regularly
  • Modifies its own code to improve over time
  • Can incorporate human feedback when provided

Previously I wrote about GlobaLLM, an AI agent that autonomously contributes to open source projects.
While GlobaLLM's primary objective is to do project and task prioritization at scale, Ouroboros focuses on task implementation and self-improvement.
Ouroboros is thus a component of GlobaLLM's solution.

Ouroboros follows a structured nine-step cycle that repeats continuously:

  1. Read goals – Fetches tasks from agent/goals/active.md
  2. Select goal – Picks one to work on (or defaults to self-improvement)
  3. Plan – Uses an LLM to create a step-by-step plan
  4. Execute – Carries out the plan using available tools
  5. Journal – Writes results to a daily log
  6. Reflect – Analyzes what happened and identifies improvements (both task-related and self-related)
  7. Self-modify – Edits its own source code if improvements are found
  8. Journal again – Records reflection and modification results
  9. Repeat – Starts the cycle anew

This separation between execution and self-modification is crucial.
The agent won't modify its code while working on a task - reflections and improvements happen only during dedicated reflection cycles.

┌─────────────────────────────────────────────────────────┐
│                      Agent Core                         │
│  (coordinates the loop, handles signals, manages state) │
└─────────────────────────────────────────────────────────┘
         │                    │                    │
         ▼                    ▼                    ▼
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Memory     │    │  LLM Layer   │    │   Tools      │
│              │    │              │    │              │
│ • Working    │    │ • Anthropic  │    │ • run_cmd    │
│ • Journal    │───▶│   Claude     │───▶│ • read_file  │
│ • Goals      │    │ • Token      │    │ • write_file │
│ • Feedback   │    │   tracking   │    │ • search_*   │
└──────────────┘    └──────────────┘    └──────────────┘

Ouroboros uses a three-tiered memory architecture:

Tier Description Location
Working memory Current goals, immediate context In-process
Short-term Daily journals (notes, reflections, feedback) agent/journal/YYYY/MM/DD/
Long-term Git history with descriptive commits Git repository

Everything is logged in human-readable markdown, making it easy to inspect what the agent has been up to.

The agent comes with built-in tools for common operations:

  • run_command – Execute shell commands
  • read_file – Read file contents
  • write_file – Write to files
  • search_files – Find files by pattern
  • search_content – Search within files

Crucially, Ouroboros can create, register, and use new tools that it writes itself.

Tools are implemented as CLI commands of the ouroboros CLI that the agent can invoke during execution.

The execution/reflection separation prevents runaway self-modification.
The agent can only change code during a dedicated reflection phase, and all changes are committed to git with descriptive messages explaining the "why" behind each change.

Every action is logged. Want to know what the agent did? Check the daily journal:

  • agent/journal/YYYY/MM/DD/notes.md – What it did
  • agent/journal/YYYY/MM/DD/reflections.md – What it learned
  • agent/journal/YYYY/MM/DD/user-feedback.md – Human input received

Ouroboros needs no human intervention, but welcomes it.
It will happily incorporate feedback, adjust course based on user suggestions, and explain its reasoning when asked.

  1. True self-improvement – The agent can and does modify its own implementation based on reflection
  2. Persistent memory – Git commits serve as a permanent, queryable history of everything tried
  3. Graceful degradation – Failed modifications can be reverted; the agent learns and tries again
  4. Tool extensibility – New tools can be created dynamically as needs arise
  5. Idle improvement – When no goals are active, it works on making itself better

Ouroboros represents an experiment in autonomous AI agents.
Can an agent truly improve itself over time without (or limited) human intervention?
By maintaining a detailed journal, reflecting on its actions, and having the freedom to modify its own code, Ouroboros aims to answer this question.

The name is fitting - the serpent eating its tail represents the continuous cycle of doing, learning, and improving that drives the agent forward.
Each reflection builds on the last; each modification makes the agent slightly more capable.

Ouroboros is open source.
Check out the repository to see the code, contribute, or run your own self-improving agent.

Consider the following dilemma: you have unlimited access to state-of-the-art LLMs, but finite compute resources.

How do you maximize positive impact on the software ecosystem?

GlobaLLM is an experiment in autonomous open source contribution which attempts to address this question.

It's a system that discovers repositories, analyzes their health, prioritizes issues, and automatically generates pull requests - all while coordinating with other instances to avoid redundant work.

The core insight isn't just that LLMs can write code; it's that strategic prioritization combined with distributed execution can multiply that capability into something genuinely impactful.

This article explains how GlobaLLM works, diving into the architecture that lets it scale from fixing a single bug to coordinating across thousands of repositories.

GlobaLLM follows a five-stage pipeline:

Discover -> Analyze -> Prioritize -> Fix -> Contribute

The system begins by finding repositories worth targeting.

Using GitHub's search API, it filters by domain, language, stars, and other criteria.

Current methodology uses domain-based discovery with predefined domains (ai_ml, web_dev, data_science, cloud_devops, mobile, security, games), each with custom search queries combining relevant keywords.

The system then applies multi-stage filtering:

  1. Language filtering: Excludes non-programming languages (Markdown, HTML, CSS, Shell, etc.)
  2. Library filtering: Uses heuristics to identify libraries vs applications (checks for package files like pyproject.toml, package.json, Cargo.toml; filters out "awesome" lists and doc repos; analyzes descriptions and topics)
  3. Quality filtering: Language-specific queries include testing indicators (pytest, jest, testing)
  4. Health filtering: Applies health scores to filter out unmaintained projects
  5. Dependent enrichment: Uses libraries.io API to fetch package dependency counts for impact scoring

Results are cached locally (24hr TTL) to avoid redundant API calls and respect rate limits.

The goal isn't to find every repository - it's to find libraries where a contribution would matter.

Once a repository is identified, GlobaLLM performs deep analysis to determine whether contributing is worthwhile.

This gate prevents wasting resources on abandoned projects, hostile communities, or repositories where contributions won't have impact.

It calculates a HealthScore based on multiple signals:

  • Commit velocity: Is the project actively maintained?
  • Issue resolution rate: Are bugs getting fixed?
  • CI status: Does the project have passing tests?
  • Contributor diversity: Is there a healthy community?

It also computes an impact score - how many users would benefit from a fix, based on stars, forks, and dependency analysis using NetworkX.

Repositories with low health scores or minimal impact are deprioritized or skipped entirely.

The system fetches open issues from approved repositories and ranks them using a sophisticated multi-factor algorithm.

Each issue is analyzed by an LLM to determine:

  • Category: bug, feature, documentation, performance, security, etc.
  • Complexity: 1-10 scale (how difficult to solve)
  • Solvability: 0-1 score (likelihood of automated fix success)
  • Requirements: affected files, breaking change risk, test needs

The prioritization then combines four dimensions:

Health (weight: 1.0): Repository health adjusted for complexity.

A healthy repository with simple issues scores higher than an unhealthy repository with complex ones.

Impact (weight: 2.0): Based on stars, dependents, and watchers.

Uses log-scale normalization (stars / 50,000, dependents / 5,000).

Solvability (weight: 1.5): LLM-assessed likelihood of successful resolution.

Documentation and style issues (~0.9) rank higher than critical security (~0.3) due to automation difficulty.

Urgency (weight: 0.5): Category multiplier × age × engagement.

Critical security bugs get 10× multiplier, documentation gets 1×.

The final formula:

priority = (health × 1.0) + (impact × 2.0) + (solvability × 1.5) + (urgency × 0.5)

Budget constraints filter the ranked list:

  • Per-repository token limit (default: 100k)
  • Per-language issue limit (default: 50)
  • Weekly token budget (default: 5M)

Results are saved to the issue store with full breakdowns for transparency.

GlobaLLM claims the highest-priority unassigned issue and generates a solution.

This is where LLMs do the heavy lifting.

The CodeGenerator class sends a structured prompt to Claude or ChatGPT with:

  • The issue title and description
  • Repository context (code style, testing framework)
  • Language-specific conventions
  • Category-specific requirements (bug vs feature vs docs)

The LLM responds with a complete solution:

  • Explanation: Step-by-step reasoning
  • File patches: Original and new content for each modified file
  • Tests: New or modified test files

The system tracks tokens used at every step for budget management.

The final stage uses PRAutomation to create a well-structured pull request with context, tests, and documentation.

For trivial changes (typos, version bumps), it can even auto-merge.

LLMs are the engine that powers GlobaLLM, but they're used strategically rather than indiscriminately.

Stage 3 - Prioritize: The IssueAnalyzer calls an LLM to categorize each issue.

Input: title, body, labels, comments, reactions.

Output: category, complexity (1-10), solvability (0-1), breaking_change, test_required.

This costs ~500 tokens per issue and feeds directly into the priority scoring.

Stage 4 - Fix: The CodeGenerator uses an LLM to generate complete solutions.

Input: issue details, repository context, language style guidelines.

Output: explanation, file patches (original + new content), test files.

This costs 1k-10k tokens depending on complexity.

The key insight: LLMs are only used for tasks requiring intelligence.

Discovery, health scoring, impact calculation, and PR automation use deterministic algorithms.

The real power of GlobaLLM emerges when you run multiple instances in parallel.

Each GlobaLLM instance has a unique AgentIdentity.

When it's ready to work, it calls:

globallm assign claim

This atomically reserves the highest-priority unassigned issue.

The assignment is stored in PostgreSQL with a heartbeat timestamp.

To prevent multiple agents from working on the same issue:

  1. Issues are marked assigned with an agent ID and timestamp
  2. Heartbeats update every 5 minutes
  3. If a heartbeat expires (30 minutes), the issue is reassigned

This allows crash recovery: if an agent crashes mid-work, another will pick up the issue.

The heartbeat system is elegant in its simplicity:

# Agent side
while working:
    update_heartbeat(issue_id, agent_id)
    do_work()

# Recovery side
expired = get_issues_with_expired_heartbeats()
for issue in expired:
    reassign(issue)

No distributed consensus needed - PostgreSQL's row-level locking handles contention.

PostgreSQL is the central state store:

  • Connection pooling: 2-10 connections per process (psycopg pool)
  • JSONB columns: Flexible schema for repository/issue metadata
  • Indexes: On frequently queried fields (stars, health_score, assigned status)
  • Migrations: Versioned schema

GlobaLLM is an evolving experiment.
There are numerous challenges that developers face on a daily basis that a system such as GlobaLLM will also encounter and would need to deal with in order to be more effective:

  • Parallelizing work across multiple agents without conflicts or redundant effort.
  • Producing "mental models" of repositories to better understand their goals, architecture, dependencies and trajectories.
  • Have a higher-level decision-making system that can reason about which repositories to focus on based on broader trends in the open source ecosystem.
  • Make decisions such as which programming languages to focus on.
  • Have the ability to work with closed source repositories which may not have the same signals as open source ones (e.g., forks, stars, dependency count).

GlobaLLM is an experiment in what's possible when you combine LLM code generation with principled decision-making and distributed execution.

The goal isn't to replace human contributors - it's to handle the long tail of maintenance work that no one has time for, freeing up humans to focus on the interesting problems.

The system is actively developed and evolving.

Current work focuses on better prioritization heuristics, more sophisticated validation, and integration with additional LLM providers.

If you're interested in contributing or just want to run it yourself, the code is available on GitHub.

This system is far from perfect, but it's a step toward harnessing AI to make open source software healthier and more sustainable at scale.

It's also a way to explore what it looks like to have to make decisions at the scale of millions of repositories and billions of issues.

Dr. Aris Thorne stared at the shimmering temporal rift, its edges flickering like a corrupted display. He wasn't just looking at a gateway through time – he was looking at a living, breathing Git repository of reality itself.

"Status check," he muttered, fingers dancing across the holographic interface. "Current branch: timeline-main. Last commit: 'Catastrophe at Point Zero' by User:Humanity."

Three days ago, humanity had triggered the Cascade Event – a chain reaction of temporal paradoxes that threatened to unravel existence. Now Aris, the last Temporal Archivist, was attempting something never before conceived: git revert on reality itself.

"Creating new branch: 'fix-attempt-1'," he announced to the empty lab. The temporal rift stabilized, showing a parallel timeline branching off from moments before the disaster.

Aris stepped through, materializing in the control room of the Chronos Facility, right as the ill-fated experiment was about to begin. He knew the command sequence by heart – the one that would prevent the Cascade.

But as he approached the console, he froze. His younger self was there, looking determined but naive. If Aris intervened, would he create a merge conflict with his own existence?

"Branching again," he decided, retreating to the safety of the temporal nexus. "Creating 'fix-attempt-2' from an earlier commit."

This time he arrived hours earlier, when the facility was still empty. He carefully modified the experiment parameters, ensuring the Cascade could never occur. Satisfied, he returned to his present.

The lab was unchanged. The rift still showed the corrupted timeline.

"Failed merge," Aris realized with dawning horror. "Reality rejected the patch."

Days turned into weeks as Aris created dozens of branches, each attempting to fix the timeline. He tried git cherry-pick of successful moments from history, git rebase of civilization's achievements, even git bisect to isolate the exact commit that had broken everything.

Nothing worked. Each attempt was rejected by the cosmic repository, leaving him with countless abandoned branches floating in temporal limbo.

Exhausted, Aris collapsed before the interface. "Git log --oneline --graph," he whispered, watching the tree of failed attempts bloom across the display. It was beautiful in its complexity – a constellation of what-ifs and could-have-beens.

That's when it hit him. He'd been trying to fix the timeline, to restore a previous commit. But what if the solution wasn't to revert, but to evolve?

"Creating new branch: 'transcendence'," he declared with renewed energy. "Not from any previous commit, but from the current corrupted state."

He stepped through into the fractured timeline, where temporal paradoxes manifested as impossible architecture and shifting landscapes. Instead of fighting the chaos, he embraced it. He worked with the anomalies, finding patterns in the madness.

Aris discovered that the Cascade wasn't an error – it was evolution. Humanity had outgrown its linear timeline, and reality was attempting to branch into a multidimensional existence.

"Merge request," he transmitted to the temporal repository. "Not to fix, but to complete the transformation."

The rift stabilized, its chaotic energy resolving into something new and coherent. Aris watched as all his abandoned branches began to merge into this new reality, each failed attempt contributing something essential to the final design.

When he returned to his lab, everything was different yet familiar. The temporal rift was gone, replaced by a window showing infinite timelines coexisting harmoniously.

Aris smiled at the new interface displaying the transformed reality. "Current branch: timeline-multiverse. Last commit: 'Embrace the Chaos' by User:Humanity."

He had learned the ultimate lesson of temporal manipulation: sometimes the best commit isn't a fix, but a feature.

I built something I've wanted for a while within an hour using Claude Code (Sonnet 4.5).
It's a LLM conversations viewer, available for everyone to use at https://tomzxcode.github.io/llm-conversations-viewer.
It is a single page application, 100% client-side.
You can use it to view conversations exported from various LLM platforms (e.g., ChatGPT, Claude, etc.) in a nice interface.
It can also be used to share conversations with others so that they can view them, using https://tomzxcode.github.io/llm-conversations-viewer/?url=url-to-json-or-zip (such as a raw Gist URL).
It can be useful if you use many platforms through their clients (web or mobile) and would like to be able to search those conversations in one place.
Play with it and let me know what you think!
Repository here

The readability index is a metadata field added to articles on this blog to help readers quickly assess whether an article is appropriate for their level of expertise and available attention.

Articles are rated on a 0-5 scale based on their readability:

  • 0 - Personal notes: Content that is only meaningful to me. These are often shorthand notes, context-dependent references, or incomplete thoughts that lack the necessary background for others to understand.

  • 1 - Cryptic: Readable but highly condensed or assumes significant context. These articles may use jargon without explanation, reference obscure concepts, or present ideas in a very terse manner. You might be able to extract value, but it requires effort and potentially external research.

  • 3 - Specialized/Expert audience: Articles written for readers with domain expertise. These assume familiarity with technical terminology, concepts, and background knowledge in a specific field (e.g., machine learning, software architecture, AGI research). The writing is clear if you have the prerequisite knowledge.

  • 5 - General audience: Articles written to be accessible to anyone with general reading ability. These explain concepts from first principles, define technical terms when used, and don't assume specialized background knowledge. They're structured for easy consumption.

The readability index serves several purposes:

  1. Reader efficiency: Helps readers quickly determine if an article matches their current context and expertise level before investing time in reading it.

  2. Content discovery: Makes it easier to filter or search for articles at the appropriate level - whether you want deep technical content or accessible introductions.

  3. Author awareness: Forces me to be conscious about my target audience when writing, which can improve clarity and focus.

  4. Archive navigation: As this blog contains a mix of polished articles and personal research notes, the index helps distinguish between content types.

This index is subjective and represents my assessment at the time of writing. Your mileage may vary - what I consider a "3" might be a "5" for experts in that field or a "1" for beginners.

You might notice the scale uses discrete values (0, 1, 3, 5) rather than every integer from 0 to 5. This is intentional:

  • 2 would fall between "cryptic" and "specialized" - a fuzzy middle ground that's hard to define
  • 4 would be between "specialized" and "general" - again, unclear distinction

The four-point scale provides enough granularity to be useful without creating artificial precision. If an article truly feels intermediate, I'd likely rate it at the lower level (1 or 3) since it's better to underpromise and overdeliver on accessibility.

The readability index is independent of the article's status field:

  • status: draft - Article is incomplete or actively being written
  • status: in progress - Article is being updated and refined
  • status: finished - Article is complete and unlikely to be revised

An article can be status: finished with readability: 0 (polished personal notes) or status: draft with readability: 5 (an in-progress accessible introduction). They measure different dimensions of the content.

Like any other metadata on this blog, readability ratings may change over time as I revisit articles or as my sense of what constitutes each level evolves. The index is a tool for navigation, not a rigid categorization system.