# Building a Personal AI Operating System — What 128 Sessions Taught Me

## A case study in longitudinal human-AI collaboration

---

## Who this is for

This isn't a tutorial. It's a field report.

If you're a developer curious about what serious daily Claude Code use actually looks like — the failures, the patterns, the things that surprised me — this is for you. If you're thinking about whether AI tools can run real professional workflows, not demos, read on.

---

## Background

I'm not a programmer. I orchestrate.

My work spans operations, AI education, and e-commerce — multiple roles, multiple clients, dozens of moving parts every day. In March 2026 I started building what I now call Command Center: a personal AI operating system running on Claude Code CLI.

The goal wasn't a side project. It was an attempt to replace chaos with coherence.

Six weeks later: 128 sessions, 15+ live integrations, and a system that runs billing, HR, sales, content production, client demos, and team communication — without me writing a single line of code.

---

## What I built

Not a chatbot. Not a script collection.

A structured layer on top of Claude that gives it continuity across sessions:

- **Persistent memory** — markdown files that carry context, decisions, and learned behavior forward. `BRAIN.md` (who I am, how we work), `LINEAGE.md` (128-session history), `feedback/` (18+ behavioral corrections), `knowledge/` (how to do specific things)
- **Integrations** — ClickUp, Gmail, Calendar, GitHub, Vercel, Figma, Playwright, Supabase and more (15+ MCP servers)
- **Slash commands** — versioned in git, project-level, covering common workflows end-to-end
- **Running background jobs** — Mattermost bot, Telegram bot, daily AI news digest, procurement scraper — all as macOS daemons
- **Stop hook** — every session end auto-logs to `activity.log`

No UI. Three input channels: terminal, Mattermost, Telegram.

---

## What actually works

### Manually maintained memory scales further than expected

The core insight: Claude in session 128 knows what happened in session 12.

Not because of native memory features. Because of a disciplined markdown structure that gets written and read every session. LINEAGE.md is a 128-entry log — what was worked on, key decisions, lessons, handoff notes. BRAIN.md is the mental model map.

The result is something that feels qualitatively different from a fresh chat. Context accumulates. Decisions have reasons attached. Mistakes don't repeat.

This works. It requires discipline to maintain — but it works better than I expected.

### Behavioral adaptation without model changes

`memory/feedback/` holds 18+ correction files. Each one is a rule I gave after Claude did something wrong or I wanted different behavior:

- "Don't suggest this project unprompted — only when I bring it up"
- "No bold or em-dashes in messages. Direct, not corporate."
- "Map what persists before changing any UI state"
- "Don't frame a working feature as a risk"

These are applied through context, not fine-tuning. Behavioral drift corrects over sessions. This is manual fine-tuning — slow, but real.

### End-to-end workflows with zero code written

The most surprising outcome: I run production pipelines I couldn't have written myself.

Client demo in under 2 hours from brief (SvelteKit → GitHub → Vercel). Invoice generation across 3 legal entities. HR candidate pipeline from application to ClickUp entry to email. Daily procurement scraper posting to a team channel.

None of this required me to write code. It required me to know what I wanted clearly enough to describe it — and to debug the *intent*, not the syntax.

---

## What breaks

### No live status layer

After 128 sessions and 15+ integrations, I can't tell at a glance what's actually running right now. The memory system tells me what *should* work. It doesn't tell me what *currently* works. This is the main gap — and a real source of friction.

### Speed outpaced understanding

128 sessions in 7 weeks. The system grew faster than my mental model of it. The potential is visible. Full control isn't always there.

This is an honest admission: building fast is not the same as understanding what you built.

### Memory is brittle

Everything lives in markdown files. No validation, no consistency checks, no database. If files drift or disappear, accumulated value goes with them. The system has no self-repair.

### Bot asymmetry

The Mattermost and Telegram bots run subsets of what the terminal session can do — limited by which MCP servers are available without OAuth. This gap isn't obvious until something fails mid-workflow.

---

## Real failure modes (documented)

These aren't hypotheticals. They happened.

**Apple Seatbelt sandbox + `gh` CLI (sessions 120–126):** The security sandbox blocked `gh` by denying XPC access to `com.apple.trustd.agent`. Go binaries need it for TLS. Symptom was `x509: certificate signed by unknown authority`. Workaround: replace `gh` calls with `curl` using a token from `~/.config/gh/hosts.yml`. Conclusion: the sandbox added marginal security at the cost of broken real work. Removed.

**Background agent stuck on dev server (session 127):** Spawned a background agent to build a project. After 10+ minutes, no commit. Dev server process never exits — agent waited indefinitely. Invisible from the main session. No timeout, no signal, no detection. Fixed by manual takeover.

**Svelte version mismatch (session 127):** Agent generated Svelte 5 event handler syntax into a Svelte 4 project. `ParseError: Unexpected token`. The cause wasn't obvious from vite build output. `npx svelte-check --output machine` gives exact file and line — vite doesn't.

**Commented line in .env (session 126):** `# CLICKUP_API_TOKEN=` — substring match returns True, regex `^CLICKUP_API_TOKEN=` doesn't match. A script appeared to update the value but changed nothing. Silent failure. Fix: regex `^#?\s*CLICKUP_API_TOKEN=`.

---

## What this shows

### The biggest value isn't capability — it's continuity

Individual Claude capabilities are well documented. What's underdocumented is what happens when you use it *consistently*, *daily*, for *real work*, over *weeks*.

Continuity compounds. Corrections stick. The system gets closer to how you actually think. This is qualitatively different from using a powerful tool occasionally — it's closer to building a working relationship.

### The non-programmer power user is a real category

I wrote zero production code across 128 sessions. I still run infrastructure most developers wouldn't build.

This category of user — let's call them orchestrators — thinks in workflows and outcomes, not syntax. They have different failure modes (silent failures, intent drift, output that looks right but isn't), different needs (observability without stack traces, rollback without git expertise), and a different relationship with AI tools than either beginners or developers.

This category is going to grow. Probably fast.

### Behavioral depth requires investment

You don't get a system that knows how you think from a few conversations. You get it from consistent use, deliberate correction, and a structure that remembers both.

The people who will get the most out of AI tools long-term are probably not the ones with the best prompts. They're the ones who treat the tool as something to build a working relationship with — and invest accordingly.

### Security is a design tradeoff, not a user choice

Every time I had to choose between security and functional capability, I chose function. Not because I'm careless — because broken workflows have real costs and marginal security has marginal value when the threat model is unclear.

This tradeoff will happen for every serious user. It's a design problem that needs better solutions than "add a sandbox and see what breaks."

---

## What I don't know yet

Whether this scales. Whether the memory system holds up at 500 sessions. Whether the approach generalizes to teams or stays personal. Whether the brittle parts will collapse at a bad moment.

The system works. It also has real gaps I haven't solved. Both things are true.

---

## Why I'm publishing this

Not to market a product. I'm not selling anything.

I published this because I think there's a version of this that matters for how people think about AI adoption — not the hype version ("AI will do everything") or the dismissive version ("it's just autocomplete"), but the honest version: what does sustained, serious, production use actually look like?

This is one data point. A messy, documented, real one.

---

*Martin Andrt — March–April 2026*

*Lecturer | Operations*

*Source: 128 Claude Code sessions, LINEAGE.md, memory/feedback/*, and a lot of late nights.*