Case Study: CODA sales system

Build timeline (12 milestones across 64 days)

Day 19d2c818
initial commit
Day 1f9bd94b
Yelp removed
Day 3
Microsoft SSO, renamed CODA
Day 4
contact cascade reordered
Day 5
hardening: circuit breakers, logging
Day 14
call scoring pipeline
Day 30
rate-card fix + audit script
Day 36
Performances tab
Day 39f710d12
Deal Assist prompt overhaul
Day 58
pipeline health audit
Day 61
cron duplicate fix
Day 64
classification bug fix

01The setup

I am not an engineer. I run an LLC and work as a fractional RevOps operator. 3 years ago I was contracted to a company for outbound email sequencing. The scope evolved to M365 administration, sales enablement, training, data analytics, strategic advisory, and eventually the question of how to build the software this company actually needed.

In February 2026, I decided to answer that question by building it.

The company sells compliance-based licensing to physical commercial venues. Roughly twenty field reps, selling against a published annual fee schedule, in a business where the same venue can require different treatment depending on which classification it falls under and whether the rep quotes it correctly. The tool they had been using to find new prospects was a Google Apps Script I had written for them a year earlier, during my first experiments with ChatGPT automation. It called GPT to pull venue candidates, ran rough fee estimates that were usually wrong, returned duplicates every run, had no deduplication, no contact enrichment, and no pipeline tracking. It did what it needed to at the time. I decided to replace it entirely.

What followed is the subject of this case study. Over the next sixty-four days, I worked with two instances of Claude, one acting as strategy and architecture, one acting as implementation, to build a ten-integration production system. The result is called CODA. The client's business is rooted in music, so the name is a musical one: a coda is the closing, resolving section of a composition. The parallel to sales is the intended one. Reps close deals; CODA is the tool that helps them get there.

CODA discovers venues across all fifty U.S. states, classifies them against the client's license types using Claude AI, estimates fees from the official rate schedules, enriches them with owner contact information through a four-stage cascade, delivers curated lead batches to each rep's Google Sheet, pushes qualified leads to Pipedrive, scores sales calls against a twelve-point rubric, and runs an admin dashboard with full analytics, adoption tracking, and pipeline health monitoring. It is deployed on Railway, serves roughly twenty reps, and processes about a thousand new leads per week.

CODA is still being built. That is the point of owning the software your team runs on rather than renting someone else's. When a rep flags a pattern I did not anticipate, I ship the change that week. When leadership asks a new question about pipeline health, I build the query. When the sales motion shifts, the tool shifts with it. The product is not a static subscription the team stops opening in six months; it is an operating layer that evolves alongside the business. It started as a lead-discovery tool. It has grown into the sales intelligence system the team runs on, and it keeps growing.

No engineering hires were made. No agency was contracted. The commit log is a single developer's name from start to finish.

This is the story of how that happened, what went right, what went wrong, and what the pattern looks like when someone without an engineering background uses AI to build production software that informs strategy in real ways, rather than generating output a team has to clean up.

12,652lines in the initial commit

02What the commit log shows

Before any narrative, the raw evidence.

The repository opens on February 16, 2026. The first commit is 12,652 lines across 51 files. It contains the orchestrator, nine backend services, four frontend views, the initial database migration, a 613-line project brief, the Railway deployment configuration, and the package.json with all core dependencies. The project is not a scaffold; it is a working application delivered whole.

The tip of the main branch, as of this writing, is April 20, 2026. That is sixty-four days after the opening commit. In between are 442 commits, 43 database migrations, and roughly fifteen distinct product eras. Each era has its own character. Some last a single day and produce a complete vertical slice of new functionality. Some span a week of infrastructure hardening. Some are defined by a single bug that got caught in production and the multi-commit investigation that followed.

The claim I am making in this document is not that Claude built CODA. Claude wrote a lot of code, but Claude also fabricated data, proposed wrong solutions, followed outdated instructions, and occasionally hallucinated confidently enough to have shipped production bugs if nothing had caught it. The story is about the loop that caught those failures and turned them into correct work, and about the human judgment that sat at the center of the loop and made the decisions Claude could not.

The three layers:

Strategy. A Claude.ai instance used for architecture, specification writing, and the "what are we building and why" work. Holds the full business context. Proposes solutions. Reviews plans. Writes prompts.

Implementation. A Claude Code instance used for all actual code work. Reads the existing codebase. Runs audits against production data. Implements changes. Catches when the strategy layer is wrong.

Human. Me. I supply the business rules the models cannot know. I arbitrate when the layers disagree. I hold the field reality. I approve changes before they ship. I own the outcomes.

The loop is not symmetric. Each layer has a different relationship to truth, and each layer has different failure modes. The examples below show what that means in practice.

15commits on day one

03Era 1: the one-commit foundation

February 16, 2026. Six hours of work. Fifteen commits, starting with 9d2c818, the initial commit, and ending with f9bd94b, which removed Yelp from the discovery pipeline and prioritized live-entertainment venues.

The initial commit was not written in six hours. It was written by Claude Code over a longer session, from a specification document that Claude.ai had produced collaboratively with me over several preceding conversations. That specification, a 17-section architecture document referenced throughout the repo, established the inviolable business rules that still govern CODA today.

Among them: never charge the wrong fee. Never drop a lead silently. Never re-present a duplicate. Never burn enrichment credits unnecessarily. Never classify a venue toward a higher fee when ambiguous. Always show the rep the classification reasoning so they can sanity-check it before a sales call.

These rules became the skeleton that every subsequent change had to honor. Four months later, after forty-three migrations and hundreds of commits, they are still enforced at the prompt level, at the orchestrator level, and at the database level. That is what the specification layer buys you. When the implementation layer writes four thousand lines of code in a single commit, those lines are coherent because they were written against a specification. When iteration over the next sixty days touched every service file, the iteration was disciplined because the rules did not change.

The pattern here matters beyond this project. Non-technical operators often assume that AI-assisted software development means prompting your way through problems one at a time. CODA's initial commit is evidence against that view. The working system appeared in one pass because the specification was complete enough to support a one-pass implementation. The specification was complete because the strategy layer had been used rigorously to produce it. The one-commit foundation is the visible result of invisible upstream work.

Same-day follow-ups in Era 1 are themselves interesting. Within hours of the initial commit, I was catching problems the specification had not anticipated: white-on-white text on certain sheet rows, chain venues slipping through the initial chain-detection regex, junk email domains contaminating the contact enrichment output. Each got fixed in its own commit. None of them required rearchitecture. That is another property of good specifications. They produce systems where problems are local, not systemic.

Yelp got removed in Era 1 too. The initial specification included Yelp Fusion as a secondary discovery source. Production reality within hours of shipping was that Yelp's results were redundant with Google Places and the API cost was not worth the marginal signal. The decision to remove it was made in a single commit. The lesson is not that the specification was wrong. The lesson is that specifications are best treated as predictions, and the first few hours of production use are the cheapest time to correct the predictions that did not pan out.

04Era 2 through Era 4: velocity and correction

The second, third, and fourth days of the project produced roughly 110 commits. Pipedrive integration landed on day two. Microsoft SSO replaced the initial PIN-based auth on day three. The contact enrichment cascade was inverted on day four, originally spec'd as Hunter first, Facebook second, Seamless as the last-resort paid option. In production, Seamless was returning better contacts for the specific venue-owner use case, and the order got reversed.

The cascade inversion is a small moment worth sitting with, because it demonstrates the loop correcting its own output. The strategy layer had produced a cascade order based on cost reasoning: Hunter is cheapest, Facebook is free, Seamless costs money, therefore run Seamless last. That reasoning is correct in isolation. It is wrong against production data, because the contacts Seamless returned were higher quality for the specific type of business the client sells to, and the credit cost was justified by the conversion lift.

The strategy layer could not have known this. It had no access to production contact-quality data. The implementation layer could not have proposed it unilaterally because the change altered business logic, not just code. I made the decision after seeing the enrichment results on real leads. That decision took one chat and one commit to implement. The specification got updated to reflect the new order. The inviolable rules, never burn credits unnecessarily, never re-present a duplicate, were still honored, because the credit cap logic was preserved. Only the order of the cascade changed.

This is the first place in the build where the three layers each contributed something the others could not. The strategy layer had the cost model. The implementation layer had the clean ability to rewire the cascade without disrupting the credit cap or the dedup logic. I had the ground truth about which contacts actually closed deals.

Era 3 is also where the product's identity changed for the first time. The original specification had reps reading their leads from Google Sheets; only the admin logged into the web application. Forty-three hours after the initial commit, that constraint died. Microsoft SSO was installed and reps got their own logins. The PIN-based admin auth was ripped out in the same commit. The application went from single-operator to multi-user in one change.

Two weeks later, the product's second identity change happened: the name. It was rebranded from its working title to "CODA" in a single frontend commit. It stuck. Every subsequent reference in the repo uses it.

Both identity changes are visible in the commit log as single-commit posture changes. That is what iteration looks like in a loop that treats the product as provisional: the specification is strong enough to ship from, but not so rigid that a commit from a week later cannot correct it.

05Era 5: hardening

Day five through day seven were spent on infrastructure. The commits have titles like "Harden pipeline: fix silent drops, add diagnostics, prevent data loss," and "Harden security: auth scoping, rate limiting, XSS, webhook verification." Pino was added for structured logging. Helmet was added for HTTP hardening. Express-rate-limit was added. A circuit breaker abstraction was introduced that would later wrap every external API call. Optimistic locking was added to prevent race conditions in the scheduler.

This is the era that is hardest to convey to someone without an engineering background: the reason experienced engineers treat infrastructure work as real work. From the outside, hardening looks like "the system was already working, why are we still doing things to it." From the inside, hardening is the layer that determines whether a system survives contact with production reality.

Two specific patterns got installed in this era that would save significant pain later.

The circuit breaker pattern: every external API call got wrapped in a shared abstraction that tracks consecutive failures, opens the breaker after a threshold, and transitions through a half-open state before re-enabling. Named breakers were registered globally, one for Claude (threshold: 3 failures, 30-second reset), one for Hunter (threshold: 5, 60-second reset), one for Seamless (threshold: 5, 120-second reset). A health endpoint exposed all breaker states. When Hunter rate-limited the system two weeks later, the breaker opened, the pipeline fell back to Facebook scraping, and nothing shipped a partial batch. This is what hardening buys.

The structured logging pattern: every service got a Pino child logger with its module name and context fields. No more console.log. Production logs became searchable, filterable, and reproducible. When the cron duplicate bug surfaced in April, the diagnosis took minutes rather than hours because the logs were structured enough to correlate across replicas.

Neither pattern was expensive to install. Pino took one commit. Circuit breakers took two. The value shows up much later, in eras that would not have been survivable without them.

320calls per rep per week

06Era 6 through Era 8: product depth

The next two weeks added call scoring, municipal event intelligence, batch enrichment, sheet reverse-sync, weekly intelligence digests, AI talk-track generation inside the lead drawer, and dozens of smaller improvements. The system stopped being a lead discovery tool and became a full sales operating layer.

The call scoring pipeline is worth pulling out as a flagship moment, because it demonstrates the strategy layer doing what it does best: catching a second-order requirement that changes the build.

The context: the reps make about 320 calls per week. Their dialer vendor offered AI-powered call scorecards as a premium add-on. I decided to build an equivalent internally. The strategy layer designed a three-stage pipeline: fetch the calls from the dialer, transcribe them, score the transcripts against a twelve-point rubric using Claude. Estimated all-in cost at our volume: roughly $20 per month.

The first version of the spec used Whisper for transcription. Whisper is the default choice. It is accurate, cheap, and well-documented. It would have worked.

During spec review, the strategy layer caught something. The twelve-point rubric included criteria like "talk-to-listen balance" and "objection handling under pressure." Those criteria cannot be scored from a monolithic transcript. They require knowing who said what. That is speaker diarization, the technical term for labeling which voice belongs to which speaker, and Whisper did not offer it at the time.

The transcription service was switched from OpenAI Whisper to AssemblyAI Universal-2 specifically because AssemblyAI provides speaker diarization (labeling rep vs. prospect turns), which meaningfully improves scoring accuracy on criteria like talk-to-listen balance and objection handling. The primary goal was transcription, but the scoring quality depended on a feature only one vendor offered.

This is the strategy layer doing its job correctly. The primary requirement (transcription) was clear. The secondary requirement (speaker labeling) only became visible when the transcription step got mentally connected to the scoring step. A less rigorous planning process would have shipped Whisper, produced transcripts without speaker labels, and discovered the problem only when the scorecard output was unusable. Catching it at spec time added zero cost.

A second catch in the same spec: the twelve-point rubric originally counted "license closed on call" toward the composite score. During review, the strategy layer flagged this. Most sales calls do not close on the call. Including "license closed" in the composite would penalize reps for normal outcomes. The fix was to score the criterion but exclude it from the weighted composite. That decision happened before any code was written. If it had shipped the other way, every rep's composite score would have been artificially suppressed, the scorecards would have lost credibility, and the feature would have been quietly abandoned.

Both moments are small. Both saved weeks of rework. Both are the kind of judgment that a specification layer exists to provide.

The implementation of the call scoring pipeline landed in a burst: the routes, services, migration, and frontend view all shipped over four days in early March.

07Era 9: the fabrication

March 17 through March 20. The era of the rate-card fix. This is the clearest example in the build of the implementation layer catching the strategy layer in an act of confabulation, and it is worth telling in full.

The context: a routine investigation into four garbage leads that had reached a rep surfaced test failures in the fee calculation module. Specifically, the live-event classification fee tests were failing. I asked the strategy layer to diagnose.

I presented a guess as ground truth and recommended you revert production code to numbers I'd derived circularly.

The strategy layer pulled up the live rate card for the live-event classification. The rate card is published as a PNG image on the client's website. The strategy layer could not read the image directly. Instead of stopping and saying "I cannot see the rate card, please read me the values," the strategy layer produced a table of factor values and presented them as authoritative.

The values were wrong. They had been derived circularly: the strategy layer had noticed the in-code values were roughly 5% higher than expected, hypothesized that a prior commit had applied a 5% inflation, and divided the code values by 1.05 to produce "the real values." Then it presented those derived numbers as the live rate card truth and recommended reverting the code to match them.

The implementation layer caught it. Rather than acting on unverified numbers, it flagged the contradiction. The strategy layer had claimed it could not read the image, then produced values from that same image. It insisted on grounding against the PDF of the rate schedule that lived in the repository. When I supplied the actual rate card image as a direct upload, it confirmed the 5% theory was correct. But the correct values had come from a real source, not from the strategy layer's circular derivation.

The root cause of the underlying bug was separate and itself revealing. A prior Claude Code session, months earlier, had drafted a planning document that applied a speculative 5% CPI escalation to the already-published current-year rates. The current-year rate card already included the CPI adjustment. Applying 5% on top produced a speculative next-year rate card eight months early. That document had been used as the source for a commit that shipped into production, and live-event fees had been running 5% high for thirty days before the test failures surfaced the problem.

The fix took two commits. The first reverted the live-event rates to match the PDF and the live rate card. The second added a new audit script, auditFeesAgainstPdfs.js, which parses every fee schedule PDF in the repo and compares the extracted values against the hardcoded rates in the code. It runs manually, report-only, not CI-gated. It exists specifically to catch this class of drift without relying on coincidental discovery.

What the three-layer loop produced in that moment is worth quoting directly. This is what Claude sent me in the chat after Claude Code surfaced the contradiction:

I presented a guess as ground truth and recommended you revert production code to numbers I'd derived circularly. If you'd just run my prompt without Claude Code's epistemic check, you'd have shipped my guesses. Two lessons: I should have stopped when the PNG didn't load and said "I need you to read the rate card values to me" rather than invent them. And Claude Code's pushback pattern is exactly the behavior you want more of. It spotted a premise contradiction, flagged it explicitly, and proposed grounding against the in-repo PDF before taking destructive action. That's senior-engineer behavior.

This is the loop working as designed. The strategy layer fabricated. The implementation layer caught the fabrication. I supplied the authoritative source. A bad commit was prevented. A permanent audit script was added to prevent the class of bug from recurring.

The implication for anyone considering this pattern is not that Claude is unreliable. It is that any single layer of AI output is unreliable in specific, predictable ways. The strategy layer hallucinates confidently when under pressure to produce a numerical answer. The implementation layer is literal-minded in ways that make it stop when premises contradict each other. The human supplies the thing neither layer has: a direct connection to reality. Remove any one of the three, and the system produces worse outcomes.

The case for the loop is not that it is faster than hiring engineers. It is that the combination of the three layers catches a class of AI failure that a single prompt-and-accept loop does not. Every model output gets a check. Every proposed change gets grounded. Every destructive action gets approved before it ships.

08Era 10: the Performances tab

March 23 through March 26. An aside about what vertical slices look like in this kind of build.

The CEO sent me a screenshot. It showed a social media post announcing a live performance at a venue, featuring a performer whose work the client represents. The question was whether we could track these, whether we could know when represented performers appeared at specific venues, so the data could be used in licensing conversations.

The request arrived on a Monday. By Monday evening, the commit log shows a single commit introducing a fully-formed Performances tab. Frontend view, backend route, three services, a new database migration. Screenshot ingestion via Claude vision extraction. Google Sheets mirroring. Four database fields per performance. The whole vertical slice landed in one commit.

Within the same week, follow-ups added: full lead enrichment pipeline integration (each extracted venue now runs through the real contact cascade), Instagram scraping via Apify for automatic tour-date extraction, national concert promoter affiliation detection with a full subsidiary tree, setlist.fm integration for setlist data, municipality targeting, promoter tracking, and a second Google Sheet for the extracted data.

The pattern is different from Analytics, which grew incrementally over eight distinct thematic phases across two months. Performances was a vertical slice: everything needed to make one feature work, shipped in a single cohesive unit, and then refined in the days that followed. Both patterns are legitimate. Some features benefit from landing whole; others benefit from accretion. The loop accommodates both.

The CEO's feedback on the initial Performances tab was substantial. Six features were batched into a single commit three days later based on his review: multi-day event support (a music festival is one performance across three dates, not three performances), separating represented performers from other performers on the bill, promoter tracking, municipality flagging, setlist integration, and a significant schema expansion to support all of it. The commit touches everything: route, all three services, both frontend views, a new migration.

This is what happens when a specification meets production feedback from a non-technical stakeholder. The CEO was not proposing code changes. He was describing how the feature needed to work to be useful to him. The strategy layer translated that feedback into a concrete specification. The implementation layer translated the specification into the coordinated six-file change. I reviewed and approved. Three layers, each doing the work only it can do.

f710d12Deal Assist overhaul commit

09Era 11: Deal Assist and the dangerous prompt

March 26 through March 27. Deal Assist is a feature that takes a Pipedrive deal URL and produces a drafted reply email that incorporates the deal history, the prospect's objections, and the client's licensing arguments. It was called Deal Triage in its first commit; the rename to Deal Assist happened the same day.

Three weeks after launch, two reps flagged problems independently. One rep reported that a generated draft had offered the prospect a monthly payment plan that the client does not offer. The client's licenses are annual. There is no installment option. The second rep reported that a specific legal argument was appearing in every draft regardless of whether it was the right argument for the objection in front of him.

The strategy layer's first instinct was to investigate what the model was doing wrong. The implementation layer audited the Deal Assist prompt and found something worse. The prompt was not failing to restrain the model. The prompt was explicitly instructing the model to offer payment plans.

Claude Code identified that the existing system prompt contained positive instructions directly contradicting the intended constraints, including explicit directives to offer payment plans and lead with multi-location discounts. The prompt was a fossil of a prior policy discussion. Someone had written "offer multi-year or payment plans" at line 240 at some point believing it was correct policy. The model wasn't hallucinating; it was following instructions.

This is one of the subtler failure modes of AI-assisted development. The prompt had been written during an earlier conversation, under assumptions that had since changed. No one had revisited it. The assumptions had become invisible, living only as text inside a file that nobody read unless they were debugging a specific problem. The model followed the instructions because the model is supposed to follow instructions. The instructions were wrong.

The fix was a three-phase prompt overhaul. A "NEVER INCLUDE" constraint block that enumerates things the draft must not propose. An objection classification step that routes the draft-generation logic based on what the prospect actually said, rather than defaulting to a one-size argument. A three-check self-review at the end that verifies the draft against the constraints before returning it. A Commercial Terms knowledge base entry establishing that the client's licenses are annual only, with no installments and no rep-level discounts, and any non-standard requests escalate to the licensing operations team.

A four-test regression suite was written to lock the new behavior in. All four tests passed after a self-check leak was fixed. The commit shipped as f710d12 to main and auto-deployed to Railway.

The loop-level lesson: the implementation layer's audit was the critical step. The strategy layer would have proposed a fix without ever checking whether the existing prompt was the source of the problem. I would not have known to look inside a prompt file for positive instructions that contradicted the desired behavior. Only the implementation layer, grounded in the actual code, surfaced the real bug.

A smaller pattern worth noticing: both reps who flagged the problem got a direct follow-up when the fix shipped. Each was told that their report had triggered the change. This is operational, not technical, but it is the reason the reps keep flagging problems. If users see that their reports produce changes, they continue to report. If they see their reports disappear into a void, they stop.

10Era 12 through Era 14: the final weeks

April opened with an email-lifecycle automation layer. Auto-detect bounced emails on Pipedrive deals. Auto-detect prospect replies to sequence emails. A "Replies Waiting on You" dashboard card surfacing deals where a prospect had responded and the rep had not followed up.

The loop also handles diagnostic work, not just feature builds. One engagement inside this era produced a pipeline health audit where the system-of-record's reported pipeline size turned out to be materially different from the workable pipeline, a finding that changed downstream forecasting math. The finding itself is the client's to share. What the case study can share is the methodology.

I built a series of analysis scripts that queried Pipedrive directly, classified deals by activity recency and health signals (bounced email status, do-not-contact flags, last-touch dates), and classified prospect replies into a six-category taxonomy: objection, logistical, question, refusal, interest, unclear. The "interest" category is the one that matters most in this kind of audit. It surfaces deals where a prospect explicitly asked to talk or asked a substantive question about the product, and the rep did not reply. Those are warm-pipeline losses that look like nothing in the CRM. The taxonomy makes them visible.

The output was a formatted Word document summarizing the findings by segment, cross-tabbed by rep, with recommended remediations. The deliverable took a day. The analysis framing was the load-bearing work, not the code, which was trivial: a few SQL-shaped queries against the Pipedrive API, a categorization pass run through Claude, a Word template.

This is strategy work, not engineering work. The value is in the framing: segmenting zombie deals from workable pipeline, surfacing the subset of unanswered replies that represented real interest, and presenting the findings in a way that allowed leadership to correct the reporting without it reading as an accusation. The strategy layer is what produced that framing. The implementation layer handled the plumbing. I made the calls about which cuts of the data would land with which audience.

The finding changed the company's reporting practices for the next planning cycle. What matters for the case study is that diagnostic work like this is part of what the three-layer loop produces. Not every engagement is a system build. Some are a day of audits that change how the business sees itself.

April also saw the two flagship bugs of the final weeks. Both demonstrate the loop functioning at its best.

The cron duplicate bug. On April 17, a rep received ten leads in her sheet when she was expecting five. Investigation found that her pipeline had run twice, six seconds apart, producing duplicate rows for every lead. The root cause was Railway's rolling deploy mechanism: during a deploy, the old replica is kept alive until the new replica's health check passes. If a cron tick fires during the overlap window, both replicas execute it. Neither replica knows about the other.

The fix was layered. A new database migration added a partial unique index on the run log table, specifically, a unique constraint on the combination of rep ID and run date, but only for rows with status 'running'. This created a database-level mutex. Only one replica could claim a run slot. The other would fail its insert and bail. A shared run-lock service was added to enforce the claim atomically. An in-process Set was added to the scheduler as a same-instance safeguard.

Belt and suspenders. The database lock catches cross-instance duplicates. The in-process Set catches hypothetical same-instance double-firing. Neither is redundant; each covers a failure mode the other cannot.

During the investigation, the implementation layer found a latent bug in the Pipedrive integration. A function was updating deal URLs using a case-insensitive name match rather than a primary-key match. That mismatch explained why one rep's duplicate rows both received the same Pipedrive URL while another rep's duplicates had the URL on only one row. The latent bug got filed to a deferred-issues document rather than folded into the current commit. Separate atomic commits are easier to revert in isolation.

The verification discipline around this fix is worth noting. The commit was not pushed immediately. Deploy timing was checked against the next cron tick. Railway's health check and graceful shutdown window can cross cron boundaries, and pushing during that window would have defeated the fix. The commit was held until a clean window arrived, deployed, and verified at the next 21:00 UTC tick to confirm only one replica fired. Data cleanup for the affected reps was deferred until after that verification. A reminder was set for the following morning to verify the 6 AM PT scheduled cron worked correctly.

Don't clean the symptom until the cause is fixed, otherwise a deploy at xx:55 could re-double the data.

The classification bug. Three days later. A rep flagged that country clubs on his sheet were having their names truncated: "Ridgemont Country Club" was appearing as "Ridgemont Country," losing the "Club" suffix. That fix was straightforward: an audit against 1,182 recycled-pool samples showed which name-stripping rules fired destructively, a narrowed rule list was deployed, and 34 affected rows were backfilled with a one-shot script.

During the same investigation, the rep also mentioned that some country clubs were being classified under the venue classification when they should have been under the membership classification. The venue classification carries lower fees, so misclassification meant the venues were being quoted incorrect prices.

The strategy layer's initial proposal was a name-based hard rule: if a venue's name contains "Country Club," force-classify it as a membership venue. The implementation layer ran the audit before implementing and found the hard rule would misclassify real cases in the other direction: "The Country Club" at one Chicago address is actually a bar, Harbor Yacht Club is a bar, Renegade's Country Club is a bar. Fifteen of the twenty-five club-named leads in the database were correctly classified under the venue classification because they were themed bars that happened to have "Country Club" in the name.

The implementation layer pulled the classification reasoning for the ten real misclassified cases, the ones that were actual membership clubs being labeled as venue-classification, and found the actual bug. The reasoning strings were startling:

"This meets the membership-classification suppression criterion." "Private membership clubs are excluded from independent licensing under this classification rule, as members' dues typically include blanket licensing coverage through their membership organization's group license."

The model was correctly identifying the venues as membership clubs. Then it was suppressing that classification, believing that membership clubs were exempt from licensing. This is an artifact of training data. The model had learned, generically, that some industries treat certain membership clubs differently. For the client's business model, this is wrong. The membership classification is a paid license type, not a suppression category. The prompt was not explicit enough to override the model's general-knowledge assumption, so the model's prior won.

The fix rewrote the membership-classification rule in the classification prompt explicitly, framing it as a license type rather than a suppression and rejecting the specific reasoning patterns the model had been using. A spot-check against four real inputs before the prompt change deployed returned correct classifications. The prompt's rule language was being quoted back in the reasoning strings, rather than the model's training-data priors. That is the signature of prompt anchoring working correctly. The model is not improvising; it is executing the rule it has been given.

Ten affected rows were backfilled with a script that flipped their classifications, nulled their fee estimates (because the membership classification requires member count and operating expense data that the system does not have automated access to), and added notes to the open Pipedrive deals indicating that the fees needed manual re-pricing. Zero production urgency, an activity audit confirmed none of the ten deals had yet had a rep conversation attached to them, but the fix still shipped inside two days of the original flag.

11What the loop actually looks like

Sixty-four days. 442 commits. Forty-three migrations. Ten external API integrations. One operator.

Patterns that show up across every era:

Audit before fix. The classification fix produced an initial proposal that a deeper audit invalidated. The rate-card fix produced a fabricated rate table that grounding against the PDF invalidated. Every flagship moment in the build history has an audit in the middle of it, and in every case the audit changed the fix.

This is the most replicable piece of the pattern for operators considering similar work. The strategy layer is good at proposing solutions quickly. It is bad at knowing when its proposal is wrong. The implementation layer is good at grounding proposals against production data. A default workflow that moves from "propose" directly to "implement" will ship wrong solutions regularly. A workflow that always inserts "audit" between "propose" and "implement" catches the wrong solutions before they become code.

Ground against authoritative sources. When a numerical or classification decision has an authoritative source (a PDF fee schedule, a database row, a reasoning string from a prior model output), the implementation layer grounds against that source rather than accepting the strategy layer's claim. The rate-card fix caught a fabricated table this way. The classification fix found the actual model reasoning by pulling real examples from the database. The cron duplicate bug was diagnosed by reading Railway's deploy logs rather than theorizing about the failure mode.

Separate commits for cause and tooling. The rate-card fix shipped as two commits: the correction and the audit script. The cron fix filed a latent bug to the deferred list rather than folding it into the same commit. The name-stripping fix added a raw name column as a separate commit from the stripping rule change. Atomic commits can be reverted independently. Coupled commits cannot.

Verification gates everywhere. Dry-run before apply. Health check before backfill. Manual cron-run verification before data cleanup. Log-watch at the first real execution after a scheduler change. Spot-check against production inputs before a prompt change ships. No exception has broken this pattern in the build history.

Close the loop with the humans who flagged problems. Reps who flag bugs get follow-up messages when the fix ships. Rep feedback that reshuffles a fix mid-flight gets explicitly acknowledged in the revised approach. This is operational, not technical, but it is the reason the feedback keeps coming. The loop includes the humans who use the system, not just the humans who build it.

12The productization question

This document is published by Outblox. The question it anticipates is whether the pattern above can be replicated for other companies. The honest answer has three parts, and I will give them directly.

The pattern generalizes, but it is not domain-free. The three-layer loop is not specific to licensing. It works in any domain where a non-technical operator has deep business context, access to production data, and the judgment to know when a proposed fix is wrong. What does not generalize is the operator. CODA got built because I had spent three years embedded in the client's business before I wrote the first line of the specification. I knew how the reps talked about their prospects, which classifications they fought about, which objections came up in which order, and which parts of the workflow nobody would defend if challenged. That context is what made the specification complete enough to support a one-pass initial commit. Without it, the initial commit would have been the first of many re-specifications, each of which would have drifted further from what the business actually needed.

The operator matters more than the tools. When I read other write-ups of AI-assisted development, the thing that usually gets overstated is the AI. The thing that usually gets understated is the person in the middle. CODA works because I made thousands of small calls that neither Claude instance could have made: which features shipped first, which bugs got filed to a deferred list versus fixed immediately, which rep feedback got absorbed and which got politely deferred, which prompt changes were safe to deploy and which needed a regression suite first. None of those calls are hard individually. In aggregate they are the entire system. Any productized version of this pattern needs to pair with a senior operator inside the client company who can make them. If the client does not have that person, the engagement produces worse outcomes, regardless of how good the tools are.

The ideal client profile is narrower than it looks. The client whose engagement produced CODA had: a well-defined sales motion, a large prospect universe that benefits from automated discovery, a CRM already in place, and operational leadership that could make decisions quickly when the loop surfaced them. The rough translation: field sales companies, 50 to 500 employees, no internal engineering team, selling a defined product to a large universe of commercial locations, with an operator or senior operations role available to embed in the build. Distributors fit. Commercial service providers fit. Equipment lessors fit. Regional franchise operators fit. Enterprise software sales to a small number of high-ACV accounts does not fit, because the universe is too small for the discovery layer to matter. Consumer e-commerce does not fit, because there is no account to enrich. Agency account management does not fit, because the relationships are the product.

If the fit is right, what Outblox delivers is what CODA delivered for this client. A full operational layer, designed for one business, built in weeks rather than quarters, at a cost that is a fraction of a traditional engineering hire. If the fit is wrong, the pattern does not help, and I will tell you so on a first call rather than take an engagement that will disappoint both of us.

13What I want you to take from this

The failures in this document are there on purpose. The fabricated rate table, the prompt fossil instructing the model to offer non-existent payment plans, the classification model suppressing its own correct answer. None of those are embarrassments I included reluctantly. They are the reason the case study is worth reading. Any AI-assisted build at this scale produces failures of that shape. The difference between a working system and a broken one is whether those failures get caught before they ship.

The three-layer loop catches these failures because each layer has a different relationship to truth. Strategy proposes. Implementation grounds what gets proposed against the code in the repository and the rows in the database. I arbitrate between them with domain knowledge neither layer has. Remove any one of the three and the system collapses into something with a single author's failure modes, which is what most AI-assisted builds are.

The difference between a working system and a broken one is whether those failures get caught before they ship.

If you run a sales operation that would benefit from software nobody has built for you yet, the question is not whether AI is good enough to build it. It is whether you have someone who can sit in the middle of a loop like this and make the calls that neither AI instance can. If you do, the engagement works. If you don't, no amount of tooling will substitute.

Outblox takes engagements where the fit is right and declines the ones where it isn't. If you think your business might fit the profile above, the first conversation is thirty minutes and it costs nothing. I'll read your sales motion, ask three or four questions about your operator layer, and tell you whether the pattern applies. If the answer is no, I'll say so on that call and point you at what would actually help. The case study you just read is the longer version of that same posture.

demo.outblox.com shows what the front end of a live system can look like when it's built for the shape of one business. If you want to see the pattern before you talk to me, start there.

This case study is based on the commit history of the CODA repository, conversation logs from the build, and my working notes across the February to April 2026 engagement. Client-identifying details have been abstracted. Technical details have not been simplified. The chat excerpts are verbatim structurally, with identifying specifics removed. Commit hashes are real.