- BlogBlog
- Four hours of full autono...Four hours of full autonomy for Claude Code on my ...
Four hours of full autonomy for Claude Code on my production site.
An agentic AI refactor methodology in four validation layers. 24 commits, 13 routes refactored, 0 regressions. Real metrics from kainext.cl. The point isn't that AI did it. It's how you validate it did it well.


On a Friday afternoon I gave Claude Code full autonomy to refactor the public frontend of kainext.cl. No permission asked between tasks. No questions to me. Four hours, as far as it could get.
The plan had 67 tasks distributed across 13 phases. Direction B Modern Editorial Hybrid, a visual proposal chosen with UI UX Pro Max across three axes: institutional consultancy rigor, premium technical modernity, and Chilean familiar warmth. Full token migration, dark mode elimination, new typography, accessibility, performance.
The end result is 24 atomic commits in production, zero regressions, metrics audited across every dimension. But the insight isn't that. It's the methodology that made it possible to get there, and specifically the four validation layers that had to combine for the output to be truly trustworthy.
The mistake I almost made
Three hours after kickoff, Claude Code reported having completed nine of the thirteen phases. Build green. 2189 tests passing. Coverage stable. axe-core mathematical reported zero violations on some routes. The internal narrative was tempting: the agent did the work, the metrics confirm, merge and publish.
Human visual audit of the homepage detected three things no automated test had found. A divider with gold text on white background at 2.9 to 1 contrast, failing WCAG AA by a lot. The methodology section rendering five cards in a single row when it should be three above and two below. Eyebrows in Inter instead of JetBrains Mono, breaking the typographic system the visual direction defined.
No automated test had failed. And yet, three real bugs waiting to be published. If I had merged at that moment, I would have brought them to production with the false reassurance of green metrics.
That's when I understood agentic AI works, but not the way one thinks. AI isn't the system. AI is a layer of the system. The complete system needs layers that complement each other because each has different blind spots.
The methodology in four layers
What ended up working is a four-layer system where each one validates what the previous can't see. None alone is sufficient. All together make the output publishable with real confidence.
Layer 1. Rigorous planning before touching code.
Before Claude Code wrote a single line, I spent two hours with Jesse Vincent's Superpowers and UI UX Pro Max producing an ultra-detailed plan of 67 tasks, each with exact code, verification commands, pre-written commit message, acceptance criteria. Three iterations of the plan before approving it, fixing technical errors like Lighthouse against dev mode instead of build, invalid variable font configuration, incorrect Tailwind v4 syntax.
Rigorous planning isn't bureaucracy. It's what makes the autonomous subagent not have to make decisions that require human context. Each task so precise a fresh agent can execute it without drift. Without this, full autonomy is just garbage produced fast.
Layer 2. Autonomous execution with explicit hard rules.
The execution prompt had hard rules written as a contract, not as suggestion. Don't lower coverage thresholds. Don't use any. Don't push to main. Don't expand scope to the dashboard. Don't commit without confirmation that build passes. Confidence gates before declaring a task complete, requiring explicit evidence rather than success declarations.
Subagent-driven development, not inline. Each task goes through implementation subagent, spec-compliance review subagent, and code-quality review subagent before commit. This isn't theater. It's what prevents context saturation and quality decay in late phases.
Layer 3. Mathematical automated audit.
After autonomous execution, independent validation with tools that weren't part of the generation process. axe-core e2e on the 13 public routes for WCAG 2.1 AA. Lighthouse for performance. Bundle analyzer for JS size. Visual regression with pixelmatch comparing pre and post screenshots across three viewports.
This layer detects what the plan didn't anticipate: mathematical contrast, bundle regressions, layout shifts, broken accessibility tree. It's necessary but has significant blind spots. When you saw text at 2.9 to 1, axe let it pass because the algorithm misclassified the font size and weight. Mathematical metrics get edge cases wrong more often than expected.
Layer 4. Human visual audit with judgment.
The layer I almost skipped. Claude Chrome Extension going through the rendered site in real browser, evaluating eight axes: perceived contrast, typographic hierarchy, spacing rhythm, layout grid, visual consistency, interactive states, animations, emotional brand fit. Not just what axe says. What it feels like to read it.
This layer found the divider failing AA, the broken grid, the mixed eyebrows, a forgotten eyebrow in methodology at 1.71 to 1 contrast, a CTA on the Diagnóstico 360 flagship service pointing to the contact form instead of the service landing. Bugs no mathematical metric was going to catch because they require judgment about what should happen visually, not just about what meets a threshold.
What each layer detected
The real value of the methodology shows when you observe what each layer found that the others couldn't.
Rigorous planning eliminated ambiguous decisions during execution, preventing agent drift. Without it, four hours of full autonomy would have produced architectural inconsistencies that would later be impossible to revert granularly.
Autonomous execution covered the bulk of mechanical work in three hours: token unification, dark mode elimination, hardcoded hex sweep, 13-route refactor, typography configuration, axe installation and partial resolution. What in a traditional team is two sprints of a senior dev.
Mathematical audit confirmed WCAG AA on 13 of 13 routes after iteration, reduced first-load bundle from 813KB to 571KB gzip, validated visual integrity with zero MAJOR diffs across 39 screenshots. It also verified agent self-honesty: Claude Code's autonomous report passed independent verification with zero discrepancies, all its claims auditable.
Human visual audit found six bugs the three previous layers had let through. Two blocking P0s, four aesthetic P1s. Without this layer, the site would have shipped with real compromised accessibility despite having green metrics.
Each layer does something the others don't. This isn't redundancy, it's complementarity.
Auditable results in production
The metrics that follow are verifiable against the live site, the repository reports, and Vercel Speed Insights. All taken after deploy to production.
Real Experience Score over seven days of real visitors on desktop: 100 out of 100. More than 75% of sessions had great experience by Vercel's thresholds. LCP 1.93 seconds, INP 64 milliseconds, CLS zero, FID 5 milliseconds.
First-load JS gzip went from 813KB to 571KB, a 29.8% reduction achieved with a single one-line change: ChatProvider migrated to dynamic import with server-side rendering off. TBT dropped from 70 to 50 milliseconds in production. JS bootup time from 441 to 345 milliseconds.
Lighthouse SEO went from 92 to 100, eight full points without specifically targeted SEO work. Best Practices went from 74 to 78, partially confirming the hypothesis that part of the low score was local Vercel Insights noise, but also revealing real additional debt with AdSense and bf-cache.
axe-core on 13 public routes with WCAG 2.1 AA: 13 green. Color-contrast violations: zero. Repository net delta: -902 lines of code, with coverage going from 82.53% to 83.04% without adding tests, the effect of removing orphan components the codebase was carrying.
24 atomic commits over main, all pushed, all with conventional messages, all passing lint, format, stylelint, type-check, jest, and axe individually. pre-direction-b-refactor tag preserved as rollback anchor. 12 technical reports in the audits folder documenting every decision.
What I learned about real agentic workflows
Five concrete lessons I take from this.
First. Planning is worth more than execution. Two hours of rigorous plan probably saved six hours of bugs and rollbacks. When someone asks me how to do serious work with autonomous agents, the answer isn't picking a better model. It's investing more time planning before releasing the agent.
Second. axe-core isn't enough, neither is Lighthouse, nor tests. Mathematical metrics have predictable blind spots. WCAG AA has edge cases where the algorithm gets it wrong. Lighthouse against dev mode gives invalid numbers and nobody warns you. Human visual validation remains necessary for real production.
Third. Claude Code's self-honesty is notably high when the prompt explicitly demands it. Asking it to report what it did wrong, what decisions it made without permission, what's still pending, what it would calibrate differently. The self-assessment report was verifiable against the facts on every claim. It didn't sell itself. This changes how to trust its outputs.
Fourth. Hard rules written as contract work better than suggestions. Don't lower coverage. Don't expand scope. Don't use type escapes. The difference between writing this as clear prohibition or as recommendation is the difference between a disciplined agent and one that improvises when things get complicated.
Fifth. Human visual audit is not optional. You sense it. It tells you things no test catches. If your agentic methodology doesn't have this layer, you're going to ship aesthetic and judgment bugs that erode brand credibility. For premium B2B, the difference between a site that feels premium and one that feels generic isn't measured by any axe-core.
When this approach is worth it and when it isn't
This methodology is ideal for scoped refactors on production-grade projects where you can afford two hours of upfront planning. It's not ideal for fast exploration, prototyping where speed matters more than quality, nor for projects where you don't have human validation available at the end.
I also wouldn't recommend it for teams just starting with agentic workflows. The discipline it requires comes from having gotten it wrong before with poorly scoped autonomy. If you've never used Claude Code seriously, start with small supervised tasks until you calibrate where its limits are.
For technical consulting, yes. For B2B projects with real stakes, yes. For sites where an accessibility failure has legal consequences, yes. For small teams that need big-team output, yes. And always with the four layers. None alone is sufficient.
Closing
The experiment wasn't to test whether Claude Code can. We already knew that. It was to test under what conditions it produces really publishable output, with what metrics, with what discipline, with what human validation.
The short answer: with four layers that complement each other. Rigorous plan, autonomous execution with hard rules, mathematical audit, human visual audit. None alone reaches it. All together turn an interesting tool into a trustworthy methodology.
At KaiNext we combine technical consulting with integral audit, and exactly this kind of work is what adds the most value to teams that already know where they want to go but need the technical discipline to get there well.
Let's talk about your projectFree initial meeting, no commitment.
Related Articles
More content you might find interesting
Share article
Share it with your team and follow us for more content.

