Wednesday, April 8, 2026

Feedback Flywheel


Teams have always had mechanisms for collective learning. Retrospectives,
post-incident reviews, lunch-and-learns. The best of these share a property:
they convert individual experience into shared practice. What one person
encountered in a debugging session or a production incident becomes
something the whole team knows. The knowledge escapes the individual and
enters the team’s infrastructure: its wikis, its runbooks, its code review
checklists.

With AI coding assistants, most teams reach a plateau. They adopt the
tools, develop some fluency, and then stay there. The same prompting habits,
the same frustrations, the same results month after month. Not because the
tools stop improving, but because the team’s practices around the tools
stop improving. There is no mechanism for compounding what works. Each
developer accumulates individual intuition (useful phrasings, effective
workflows, hard-won understanding of what the AI handles well and what it
does not) but that intuition remains personal. It does not transfer.

The infrastructure I have described in earlier articles — Knowledge Priming, Design-First Collaboration, Context Anchoring, and Encoding Team Standards — is not a
collection of static artifacts. They are surfaces that can absorb learning.
The missing piece is the practice of feeding learnings back in: a feedback
loop that turns each interaction into an opportunity to improve the next
one.

The Compounding Problem

My impression is that teams adopting AI tools at roughly the same time
can arrive at very different places six months later. The difference often
lies less in talent or tooling than in whether they have a practice of
capturing what worked.

Without a learning system, AI effectiveness flatlines. The team uses
the tools. The tools are useful. But the way the team uses them does not
evolve. The same gaps in the priming document cause the same corrections.
The same ambiguous instructions produce the same mediocre outputs. The
same failure patterns recur without anyone connecting the dots. What is
missing is not effort — it is a mechanism for the effort to
accumulate.

These artifacts create surfaces for learning. But surfaces alone are
passive. A priming document does not update itself when the AI defaults to
a deprecated API. A review command does not add a new check when a
category of bug slips through. They need an active practice of feeding
learnings back in.

Consider what a single session can look like when the loop is in place.
A developer uses a generation instruction to implement a new service
endpoint. A review instruction then runs on the output — and flags a
missing authorization check, exactly the kind of oversight the generation
instruction did not explicitly require. The developer fixes the issue and,
before closing the session, adds one line to the team’s learning log:
Authorization checks on new endpoints not enforced by generation
instruction.
That file lives in the repository and is already part of the
priming context for subsequent sessions. The next developer to implement
an endpoint benefits from that observation without knowing the exchange
happened; the authorization check is now part of what the AI verifies from
the first pass. The generation instruction did not change. The priming
context changed. The system learned. That is the flywheel: each rotation
of the loop leaves the infrastructure a little better prepared for the
next.

Commands evolve: when a review command misses something, it is a
command waiting to be updated. The same principle applies to every
artifact in the team’s AI infrastructure: each should evolve based on what
the team observes in practice. The question is how to make that evolution
systematic rather than accidental.

The update itself can happen in different ways. Sometimes a developer
edits the shared artifact directly, especially when the change requires
judgment or careful wording. In other cases, an agent can draft or apply
the update as part of the workflow, with a developer reviewing it before
it becomes part of the team’s shared context. I would not make one
mechanism mandatory. What matters is that the learning is captured,
validated, and fed back into the artifacts the team actually uses.

Four Types of Signal

AI interactions generate signal: information about what the team’s
artifacts capture well and what they miss. I find it useful to categorize
this signal into four types, each mapping to a specific destination in the
infrastructure.

Context signal. What the AI needed to know but did not: gaps in the
priming document, missing conventions, outdated version numbers. Each
correction a developer makes is a signal that the priming document is
incomplete. When the AI keeps using the deprecated Prisma 4.x API, that is
not a model failure; it is a priming gap. The version note is missing, so
the AI defaults to its training data. Every “no, we do it this way” is a
line that belongs in the priming document but is not there yet.

Instruction signal. Prompts and phrasings that produced notably
good or bad results. When a particular way of framing a request
consistently yields better output (a specific constraint that prevents the
AI from jumping ahead, a decomposition that produces cleaner architecture)
that phrasing belongs in a shared command, not in one developer’s head.
Instruction signal is the difference between personal fluency and team
capability. As long as it stays personal, the team’s effectiveness depends
on who happens to be prompting.

Workflow signal. Sequences of interaction that succeeded:
conversation structures, task decomposition approaches, workflows that
reliably produced good outcomes. These are the team’s emerging playbooks.
A developer who discovers that designing API contracts before
implementation consistently produces better results has found a workflow
pattern. A developer who finds that asking the AI to critique its own
output before proceeding catches issues earlier has found another. These
workflow patterns, once identified, are transferable, but only if someone
captures them.

Failure signal. Where the AI produced something wrong, and why.
The root cause matters more than the symptom. A failure caused by missing
context is a priming gap. A failure caused by poor instruction is a
command gap. A failure caused by a model limitation is a boundary to
document. With root-cause thinking, each failure points to a specific
artifact that can be improved. Consider a developer asking the AI to
generate a domain model. The output compiles — but on review, the domain
objects are nearly anemic: data containers with all behavior pushed into
service classes. It is neither a context failure nor a model limitation:
the AI knew the project’s bounded contexts and is capable of generating
rich domain models. It is a command gap: the generation instruction never
specified that behavior belongs in the domain objects, not in the classes
around them. A single constraint added to the generation instruction is
the fix.

The mapping is concrete. Context signal feeds back into priming
documents. Instruction signal feeds back into shared commands. Workflow
signal feeds back into team playbooks. Failure signal feeds back into
guardrails and documented anti-patterns. The feedback loop has specific
inputs and specific destinations; it is not an abstract aspiration to “get
better at AI.” Not every observation clears the bar: one-off edge cases
and personal style preferences stay personal. The signal worth capturing
is one that recurred, or that any developer on the team would hit working
on the same problem. It is a practice of updating particular artifacts
based on those observations.

The Practice

The feedback loop works at four cadences, each matched to the weight of
the update.

After each session: a brief reflection, not a formal process. One
question: did anything happen in this session that should change a shared
artifact? Often the answer is no. The session went fine, the priming
document had what the AI needed, the commands caught what they should
catch. When the answer is yes, the update is immediate: a line added to
the priming document, a check added to a command, a note in a feature
document. The discipline is in the question, not in the overhead. The act
of asking takes seconds. The act of updating, when warranted, takes
minutes. The easiest way to establish the habit is to anchor it to an
existing checkpoint: a field in the PR template, a single line in the
standup, or the act of closing the editor at end of day. The trigger
matters less than the consistency.

At the stand-up: for teams that already have a daily stand-up, this
is a natural place to spread useful learning quickly. A simple question
such as “did anyone learn something with the AI yesterday that the rest of
us should know?” can turn one person’s discovery into shared practice
without adding another meeting.

At the retrospective: an agenda item in the existing sprint
retrospective: what worked with AI this sprint? What friction did we hit?
What will we update? The outputs are concrete: a priming document
revision, a command refinement, a new anti-pattern documented. This is
where individual observations become team decisions. One developer’s
discovery that a particular constraint improves code review output becomes
the team’s updated review command. The tech lead or a designated owner
makes the final call on what gets committed to shared artifacts; the
retrospective is the forum for surfacing options, not for reaching
consensus on every detail.

Periodically: a review of whether the artifacts are actually being
used and whether they remain current. Which commands are being run? Which
are ignored? Where are the remaining gaps? This is the lightest cadence,
quarterly, or when the team senses that the artifacts have drifted from
practice.

The practice is lightweight by design. The heaviest cadence is a
five-minute agenda item in a meeting that already exists. If the practice
requires its own meeting, it will be the first thing cut when the team is
busy — which is precisely when learning matters most.

Knowing the practice is running is different from knowing it is
working.

Measuring What Changes

Most teams that try to measure AI effectiveness measure the wrong
things. Speed (lines generated, time to first output) measures volume, not
value. A fast output that requires extensive rework is not a productivity
gain. It is rework with extra steps.

What actually matters is harder to measure but more informative:
first-pass acceptance rate (how often the AI’s initial output is usable
without major revision), iteration cycles (how many back-and-forth rounds
a task requires), post-merge rework (how much fixing happens after code
ships), and principle alignment (whether the output follows the team’s
architectural standards). These are the indicators that the feedback loop
is working: the team’s artifacts are capturing more of what the AI needs,
and the AI’s output is converging toward what the team expects.

For teams already tracking DORA metrics, these
indicators can serve as useful leading signals. Fewer iteration cycles
usually mean less rework per change, which in turn helps shorten lead
time. Higher principle alignment means architectural drift is caught
earlier, before it reaches production, which should reduce the change
failure rate. The feedback loop is not a separate initiative so much as a
way of improving the outcomes the team already cares about. If DORA
metrics are not yet part of the practice, a simpler proxy will do: how
often does the team say “the AI knew exactly what to do”? Tracked
informally, that frequency gives an early indication that the artifacts
are helping before the broader delivery metrics move.

The honest framing: these metrics are difficult to track rigorously.
Counting iteration cycles requires a consistent definition of what
constitutes a “cycle”; that definition varies by task complexity.
First-pass acceptance is a judgment call, not a binary. In practice, the
signal is often qualitative. The team notices that AI sessions are
smoother, that commands catch more issues, that new team members ramp up
faster with the priming documents and playbooks than they did without
them. The absence of frustration — the declining frequency of “why did the
AI do that?” — is often the most reliable indicator. I would not
recommend building a dashboard. I would recommend paying attention.

Calibration

This practice matters most for teams that have already established the
foundational infrastructure from the earlier articles and want to move
from “we use AI” to “we get better at using AI.” For teams still in
initial adoption, the priority is building that infrastructure first. The
feedback loop that improves it comes after.

The trade-off is discipline without bureaucracy, a narrow path. Too
formal, and the practice becomes overhead that gets abandoned within a
quarter. Too informal, and it is indistinguishable from not doing it at
all. The after-session question, the retrospective agenda item, the
periodic review — these are deliberately minimal. The rhythm matters more
than the rigor. A team that asks “what should we update?” every two weeks
and acts on the answer will improve faster than a team that designs an
elaborate harvesting process and abandons it when deadlines tighten.

There is an urgency to this that is structural. The AI ecosystem
(models, tools, capabilities) evolves on a cadence that makes traditional
documentation decay look glacial. A priming document written when the team
adopted one model version may actively misguide when a newer version
handles context windows differently. A command designed around one tool’s
strengths may miss capabilities introduced in the next release. This is
the same dynamic teams already understand with dependency management: a
lockfile that is never updated does not stay stable, it becomes a
liability. These artifacts deserve the same treatment: reviewed
periodically, maintained with the same discipline as test suites, not
written once and filed alongside onboarding checklists. The teams that
treat them as living infrastructure will compound. The teams that treat
them as setup documentation will plateau, not because they started wrong,
but because they stopped maintaining.

The feedback loop has nowhere to go without the artifacts it improves —
start with those.

Conclusion

What distinguishes a team that merely uses AI from one that gets better
with it is not the model. It is whether the team has a way to turn each
interaction into a small improvement in its shared artifacts. That is the
role of the feedback loop. It takes what would otherwise remain personal
intuition – a prompt that worked, a failure that recurred, a missing
convention, a review gap – and makes it part of the team’s
infrastructure.

This is why I see the feedback flywheel not as an extra practice
layered on top of the others, but as the maintenance mechanism for all of
them. Knowledge Priming drifts
unless someone updates it. Design-First Collaboration improves
only when teams notice what structure helped. Context Anchoring gets better when teams see
what they failed to capture. Encoding Team Standards> sharpen when
failures expose missing checks. The infrastructure compounds only if
practice feeds back into it.

Taken together, these techniques describe a way of working with AI that
mirrors how good teams work with each other: share context early, think
before coding, make standards explicit, externalize decisions, and learn
from each session. The tools will keep changing. The teams that continue
learning through shared artifacts and lightweight rituals will be the ones
that get more value from them over time.

I would not begin with all of this at once. I would begin with one
shared artifact and one habit: at the end of a session, ask what should
change for the next one. Then make that change while the lesson is still
fresh. That is small enough to sustain, and small steps are what make the
flywheel turn.




Source link

Speak Your Mind

*


*