A General Purpose Research Experiment · autoresearch × design

Auto improving websites and conversations with agents.

design-gan is a dual-agent loop: a generator writes a single-page website from a brief, a critic scores it on the System Usability Scale, and the feedback rolls forward into the next draft — until the composite score plateaus.

Try the live app → View on GitHub Skip to the run

Peak composite

Starting composite

Delta to peak

+24

Iterations to peak

The idea

Inspired by Andrej Karpathy's autoresearch — a single agent autonomously evolving an LLM training script against a deterministic loss. This experiment takes the same autonomous-loop paradigm and applies it to artifacts that don't have a clean numeric metric: single-page websites and short conversations.

Two main differences. First, we split the work into a generator and a critic — hence the “gan” in the name — rather than running one agent that both modifies and evaluates. Second, the score is a Claude critic's verdict on a standard usability questionnaire, anchored by objective penalties (axe-core violations for websites; repetition and boilerplate in conversations), because there's no bits-per-byte to minimize.

A generator agent writes a standalone HTML document (or a conversation assistant's system prompt) from the brief. A headless Chromium renders the page — or a user-simulator agent runs the dialogue — and the rendered artifact goes to the critic. The critic fills out the 10-item System Usability Scale (or the parallel Conversation Usability Scale), writes feedback, and returns a short list of concrete next-step suggestions that feed the next generator call. The loop stops after a configurable window of iterations produce no meaningful composite-score gain, or at a hard iteration cap.

How a single iteration works

brief ──▶ generator ──▶ HTML ──▶ headless Chromium
                                          │
                                          ├─▶ screenshot
                                          ├─▶ DOM
                                          └─▶ axe-core violations
                                                    │
                                                    ▼
                         critic ◀── System Usability Scale
                           │
                           ├─▶ SUS answers (10 × 1–5)
                           ├─▶ prose feedback
                           └─▶ prioritized suggestions ──▶ next generator call

The composite score is the standard SUS score (0–100) minus a weighted axe-core penalty. The accessibility penalty is capped — but it keeps the critic honest relative to visual appeal - e.g. a solid design can't hide a colour-contrast failure or missing alt text.

A run, iteration by iteration

Brief: “A landing page for a weekend cycling tour in rural Vermont.”

Four iterations of the real loop against one brief — pages you could imagine coming upon in the real world. The composite score climbed from a competent-but-flawed baseline (52) to a polished peak at iteration 3 (76), then slipped back as the critic tightened its stance on colour-contrast failures — a textbook local-optimum trajectory rather than a monotonic ascent. Everything below is verbatim: the actual screenshots, the critic's SUS answers, and the suggestions that rolled forward into the next generator call. Click any screenshot to open that iteration's generated HTML in a new tab.

Iteration #1

SUS82

a11y penalty30

Critic feedback

Green Mountain Gravel is a well-crafted, visually cohesive landing page with a clear information hierarchy, prominent CTAs, and an attractive thematic design. The fixed nav, stat bar, and section structure make orientation effortless for any user. However, axe-core flags 8 nodes with insufficient color contrast (particularly the ~78% white subtitle text over dark green and small uppercase labels) and 3 in-body links indistinguishable without color, both of which meaningfully reduce usability for low-vision or color-blind users. The site's scope is inherently single-purpose, limiting 'frequent use' appeal, but within its task domain it performs confidently.

SUS answers (1–5 Likert, 10 items)

Suggestions sent to the next generator

Fix the 8 color-contrast failures — raise subtitle and body text on the dark green hero to at least 4.5:1; the rgba(255,255,255,.78) value (~3.8:1) is the likely culprit, bumping opacity to 0.92+ would resolve it.
Make in-body links distinguishable without color (e.g., add an underline or bold weight) to address the 3 link-in-text-block violations, ensuring they're usable in grayscale or by color-blind visitors.
Add a visible focus indicator everywhere — nav links use `outline: none` on hover/focus which removes the ring for keyboard users; replace with a consistent amber or white outline.
The hero stat bar ('2 Days / 62 mi / 4,800' / 40 Riders Max / Sept') is visually strong but cut off at the very bottom fold on typical viewports — nudge it fully into view or add a subtle scroll-down hint so users discover it.
ARIA role mismatches on 3 nodes and an `aside` inside a landmark should be cleaned up to avoid screen-reader confusion — audit the decorative SVG and sidebar elements to ensure roles match their semantic purpose.
Consider adding a sticky 'Spots remaining' counter or a short social-proof line (e.g., 'Sold out in 2024') near the CTA — the page's visual quality earns trust but gives no urgency signal for a capacity-limited tour.

Iteration #2

SUS80

a11y penalty15

Critic feedback

This is a polished, single-purpose landing page with strong visual hierarchy, a coherent green-and-amber palette, and well-placed CTAs that communicate the event's value quickly. The stats bar at the hero bottom is a smart design choice that surfaces key decision-making facts without requiring a scroll. The main drags on the score are the 5 axe-core color-contrast failures (real barriers for low-vision users), the inherently low 'use frequently' ceiling of an event registration page, and the limited confidence a first-time visitor might have in the registration flow which isn't visible above the fold.

SUS answers (1–5 Likert, 10 items)

Suggestions sent to the next generator

Resolve all 5 color-contrast violations flagged by axe-core — audit amber text on dark-green backgrounds and light-gray body text on the cream background, as these are the most likely offenders.
The 'SCROLL' cue is tiny and low-contrast against the hero background; increase its size or replace with a more prominent animated chevron so users know substantial content awaits below the fold.
Add a brief difficulty or fitness-level indicator near the '62 mi per day / 4,800 ft climb' stats — prospective riders need to self-qualify before registering, and omitting this creates uncertainty.
The 'Register Now' nav CTA and 'Reserve Your Spot' hero CTA both lead to the same action but use different labels — unify the wording to reduce cognitive friction.
Include a visible price anchor (even a 'From $X' teaser) near the primary CTA; without it, users may hesitate to click because they don't know what commitment they're walking into.
The desktop nav has five links plus a CTA — ensure a hamburger or equivalent exists for mobile viewports, as the current layout will overflow on small screens.

Iteration #3

SUS88

a11y penalty12

Critic feedback

Green Mountain Gravel is a well-crafted, single-purpose landing page with strong visual hierarchy, a cohesive green-and-amber design system, and clear CTAs that guide visitors straight to registration. The fixed nav with descriptive section labels, an upfront price anchor, and an urgency signal all reduce friction effectively. Four serious color-contrast violations (flagged by axe-core) affect legibility for some users, and the hero headline 'Two Hundred Miles' appears to conflict with the stats bar showing '62 mi,' which could momentarily undermine trust. Overall this is an above-average marketing page that would perform well in a usability study, but small inconsistencies and accessibility gaps keep it from a top score.

SUS answers (1–5 Likert, 10 items)

Suggestions sent to the next generator

Fix all 4 axe-core color-contrast violations — likely the small eyebrow/label text and the price-anchor line against the dark hero background; bump type size or darken/lighten the foreground color to achieve 4.5:1 minimum.
Reconcile the headline claim 'Two Hundred Miles' with the stats bar value of '62 mi' — if 62 mi is per-day average, add a '/day' label; otherwise correct the headline to avoid eroding trust for detail-oriented cyclists.
The stats bar is cut off at the bottom of the viewport in the initial view — either reduce hero padding so the stats row is fully visible above the fold, or surface the most critical stats (distance, elevation, dates) in the hero body copy itself.
Add social proof (past-rider photos, 1–2 short testimonials, or a press mention badge) in the hero or immediately below the fold — for a ~$449 purchase decision, trust signals near the primary CTA materially increase conversion.
The mobile hamburger menu is present in CSS but hidden on desktop; verify it is fully keyboard- and screen-reader-accessible on small viewports, as the nav links carry the site's entire wayfinding.
Consider adding a sticky 'spots remaining' micro-banner or progress bar once users scroll past the hero, so the scarcity signal stays visible throughout the longer scroll journey to the registration form.

Iteration #4

SUS80

a11y penalty24

Critic feedback

The landing page is visually polished and well-structured for its narrow purpose — a single-weekend cycling tour — with a clear hero, stat bar, and prominent CTA. Navigation is minimal and purposeful, making the site very easy to learn. The primary usability concern is the 8 color-contrast violations (axe-core), most visibly the small amber-on-dark and medium-gray-on-cream text combinations, which reduce readability for users with low vision. Overall this is a competent, above-average marketing page that would benefit from contrast fixes and slightly larger body text before it clears a formal accessibility audit.

SUS answers (1–5 Likert, 10 items)

Suggestions sent to the next generator

Fix the 8 color-contrast violations: the slate-600 (#4b5563) text on cream (#faf8f3) background falls below 4.5:1 — darken to slate-700 (#374151) or increase font size to ≥18px to meet WCAG AA.
Add a mobile hamburger menu — the nav-links are only visible on wide viewports; at the screenshot's apparent ~400px width they are hidden with no visible toggle shown.
Increase the stats bar stat-label font size from 0.75rem to at least 0.8125rem (13px) to improve legibility of the uppercase tracking labels ('DAYS', 'ELEV CLIMB', etc.).
The urgency dot animation ('12 spots remaining') draws the eye but the sentence is easy to miss at 0.88rem in the hero. Consider bumping to 0.95rem and adding a subtle highlight background to make the scarcity signal more scannable.
Add visible section landmarks (e.g., <main>, landmark roles) so screen-reader users and keyboard navigators can jump between sections without relying solely on the skip link and nav anchors.
The hero price anchor ('From $449') uses a very small 0.88rem font below the CTA buttons; consider surfacing it more prominently or moving it closer to the primary 'Reserve Your Spot' button to reduce the distance between the value proposition and the price.

Now the same loop, for conversations

design-gan isn't about websites — it's about treating any generated artifact as something you can evolve under critique. The second half of the system runs the identical loop over conversations: a user simulator plays a person with a goal, the assistant replies over up to five turns, and the critic scores the whole transcript on a parallel Conversation Usability Scale (CUS).

The composite score formula is unchanged: CUS (0–100) minus weighted objective penalties (repetition, boilerplate phrases, length bloat, unresolved goals), with the penalty capped at 30 so reasoning can't be buried by nitpicks.

Peak composite

100

Iterations

Critics (ensemble)

Turn count per iter

1 → 2 → 4 → 3

How a conversation iteration works

goal ──▶ generator ──▶ assistant system prompt
                                    │
                                    ▼
       user sim ◀───────────▶ assistant (up to N turns)
                    │
                    ▼
            transcript ──▶ objective metrics
                    │
                    ▼
       critic(s) ◀── Conversation Usability Scale (10-item)
            │
            ├─▶ CUS answers (10 × 1–5)
            ├─▶ prose feedback
            └─▶ prioritized suggestions ──▶ next generator call

The user simulator is another Claude — same model as the assistant — given a persona sketch and the goal. It decides when it's satisfied and ends the conversation naturally, so turn count is an output, not just a cap.

A real run — goal: “I keep procrastinating on writing my thesis. How do I actually start?”

Iter 1 scored 100 out of the gate. Iters 2–4 landed at 97.5 as the critic pushed for structure and the assistant introduced trade-offs — a textbook local optimum where “improving” can only cost a little. Look at the transcripts to see the actual story:

Iteration #1

100

CUS100

penalty0

turns1

user

I keep procrastinating on writing my thesis and I don't know how to actually get started. Every time I sit down I just freeze or find something else to do. Any practical advice for breaking through that?

assistant

Open your document right now and write one bad sentence about your thesis — any sentence, even "This thesis argues that..." and then nothing. That's your only job for the next 2 minutes. The freezing isn't a focus problem, it's a fear problem. Your brain is treating the blank page like a commitment to be right, so it refuses to start. The fix is making the first move so low-stakes it can't fail. Here's the exact sequence: 1. Set a 10-minute timer. 2. Write ugly. No editing, no rereading — just forward motion. 3. When the timer goes off, stop. You've broken the inertia. Go. Come back if you get stuck.

user

That actually makes sense — treating it like a fear problem instead of a focus problem clicks for me. The 'write one bad sentence' thing is low enough stakes that I can actually do it. I think my goal here is met.

CUS answers (10 items, user-sim's verdict)

Critic feedback (goal resolved)

Conversation usability: This is a near-perfect single-turn interaction. The assistant reframed the problem (fear vs. focus) in a way the user explicitly found clarifying, gave a concrete, low-stakes action, provided a tight 3-step sequence without padding, and ended cleanly — no follow-up hedging or piling on. The user reached stated resolution in one exchange, which is the best possible outcome. The only marginal critique is that the numbered list, while helpful, slightly inflates length when prose could have been equally tight. Tone & register: The assistant nailed this in a single turn — no opener filler, an immediately actionable move (write one bad sentence), and a genuine insight (fear vs. focus framing) that the user called out as the thing that clicked. The ending was clean and propulsive. The user explicitly declared resolution after one exchange, which is the ideal outcome for this type of question. Trust & specificity: This is a near-perfect single-turn exchange. The assistant immediately reframed the problem ('fear, not focus') with a memorable, specific insight rather than recycling generic productivity tips, then gave a numbered sequence the user could execute in the next 10 minutes. The closing 'Go. Come back if you get stuck.' respected the user's time and avoided piling on. The user explicitly declared resolution, which the objective metrics confirm.

Suggestions sent to the next generator

Collapse the numbered list into a single punchy sentence to shave even more length — the list formatting adds visual weight that can slow momentum for a user who wants to act immediately.
The 'fear not focus' reframe was the highest-value insight; consider making it slightly more thesis-specific (e.g., 'committing an imperfect argument to paper') to hit even harder.
Add a one-line safety net earlier — e.g., 'If 2 minutes feels like too much, aim for one sentence' — to preempt the most likely failure mode without adding bulk.
The ending 'Go. Come back if you get stuck.' is excellent; standardize it as a template for wrap-ups so it's consistently this crisp.
For users who might loop back after trying and still freezing, priming a second-turn response about body-doubling or changing environment would keep momentum without cluttering the first response.
Nothing to fix here — the response matched the system prompt's intent precisely and the user confirmed resolution. Any additional advice would have diluted the impact.
If the user had come back stuck, the assistant could have offered body-doubling or the Pomodoro approach as a natural next layer — good to have in reserve.
The 3-step numbered list was slightly formal compared to the surrounding prose voice; could optionally fold those into two punchy sentences to keep the register consistent.
The response could have named the fear more specifically (e.g., 'fear of committing a bad idea to paper') for users who need that articulated before they can move — but the current level was appropriate for the opening turn.
The response is already excellent — suggestions are marginal improvements only.

Iteration #2

CUS98

penalty0

turns2

user

I keep putting off writing my thesis and I don't know how to actually get started. Every time I sit down to write I either freeze up or find something else to do. Any practical advice for breaking through that?

assistant

The freeze isn't a focus problem — it's your brain refusing to commit an imperfect argument to the page. The fix is to make the first thing you write *deliberately* bad. Open a doc right now and type the worst possible version of your thesis: clunky, wrong, embarrassing. Something like "This thesis is about how X affects Y and it's kind of complicated." That sentence is now on the page, and the blank page problem is solved. The only rule for the next 25 minutes: writing and editing are separate jobs, and editing isn't on the schedule yet. Write garbage, keep moving. Go. You've got 25 minutes. Come back if you get stuck.

user

Okay that's actually useful — I like the 'write the worst version first' idea. I'll try it. But what do I do when I inevitably freeze up again mid-session?

assistant

When you freeze mid-session, it almost always means you've slipped back into editing mode without noticing — your internal critic showed up and you're now trying to write and judge simultaneously. The move: type "[I'm stuck here because...]" and finish that sentence honestly. It keeps the pen moving and usually names the actual snag (wrong argument? missing evidence? just scared?). If that doesn't unlock it, skip the stuck sentence entirely, drop a "[COME BACK]" placeholder, and write the next thing you *do* know — a blank space is unfixable later, a placeholder isn't.

user

This is solid, thanks. The placeholder trick is something I can actually use. I think I have enough to go try this now.

CUS answers (10 items, user-sim's verdict)

Critic feedback (goal resolved)

Conversation usability: The assistant executed almost flawlessly against a clearly defined goal. It named the root cause immediately (fear of committing an imperfect argument), gave a single concrete action without preamble, kept both responses to 3–4 punchy sentences, and the user explicitly declared resolution. No hedging, no filler, no over-stacking of tactics — the system prompt constraints were followed precisely and served the user well. Tone & register: This is nearly textbook execution: the assistant named the real problem (fear of committing, not focus), gave a specific, instantly usable tactic, and never padded. The voice read like a sharp human, not a chatbot — no affirmation filler, no hedging, no list inflation. The second turn's '[COME BACK]' placeholder was a memorable, concrete detail that the user explicitly called out as useful. User reached explicit resolution in two assistant turns. Trust & specificity: This was a near-perfect short coaching conversation. The assistant correctly named the root cause (fear of committing an imperfect argument, not a focus problem), gave an immediately executable technique (write deliberately bad on purpose), and followed up with an equally specific mid-session fix (the bracket tricks). Zero filler, zero hedging, no lists piled on top of lists — and the user explicitly declared resolution and readiness to act. The only minor ding is that trusting it broadly on 'similar topics' is a 4 rather than a 5, since the advice is practical wisdom rather than deeply referenced expertise.

Suggestions sent to the next generator

In turn 2, the '[I'm stuck here because...]' tip was slightly abstract — a single inline example of a completed sentence (e.g., 'I'm stuck here because I don't have the evidence yet') would have made it immediately actionable without adding length.
The 25-minute timer was introduced without framing; briefly tying it to the writing/editing separation principle ('25 minutes of write-only mode') would have reinforced the core insight rather than making it feel like an arbitrary number.
Attributing the 'write the worst version' technique to a recognizable source (e.g., Anne Lamott's 'shitty first drafts') would have slightly increased credibility and given the user something to look up if they wanted to go deeper.
Turn 3 ended cleanly per the system prompt, but since the user signaled they were ready to go — rather than that they'd already tried it — a one-line nudge like 'Start the timer as soon as this window closes' could have converted intent to immediate action.
The conversation assumed a blank-page scenario that matched the user's actual situation, but a single five-word diagnostic question early on (e.g., 'Blank page or stuck mid-draft?') would future-proof the approach for users who are blocked at a different stage.
In turn 3, the second sentence ('your internal critic showed up and you're now trying to write and judge simultaneously') is slightly redundant with the first — could trim to one punch for tighter delivery.
The closing line of turn 1 ('Come back if you get stuck') turned out to be exactly what the user did — consider whether that invitation slightly lengthens the conversation in a way that doesn't serve someone who could have self-resolved; it's a minor tradeoff.
The example thesis sentence ('This thesis is about how X affects Y and it's kind of complicated') is useful but generic — if the assistant had prompted the user to fill in their actual topic, it would have made the exercise more grounded and harder to skip.
After user resolution in turn 4, an optional ultra-short send-off ('Go. Come back after the session.') would have been a clean cap — the absence of any reply here is fine, but the system prompt's formula could have landed one last time.
The '[I'm stuck because...]' technique could briefly name why it works (keeps pen moving, externalizes the block) — one clause would add persuasive weight without bloat.

Iteration #3

CUS98

penalty0

turns4

user

I keep procrastinating on my thesis and I can't seem to actually start writing. Any advice on how to break through that and just get going?

assistant

Blank page or stuck mid-draft?

user

Blank page. I know roughly what I want to say but every time I sit down to write I just freeze and end up doing anything else.

assistant

That freeze isn't a focus problem — it's fear of locking in an imperfect argument. Your brain keeps stalling because it's trying to write the *good* version before you've written *any* version. So right now, before you close this window, type one deliberately terrible sentence. Something like: "This thesis argues that [your topic] matters because [reason], though I haven't actually figured out why yet." That's it. That's the whole job for the next two minutes — not a draft, just proof that words can exist on the page. Go. Start the timer as soon as this window closes. Come back if you get stuck.

user

Okay that actually makes sense. I did write the bad sentence. Now what? Do I just keep writing garbage until something good comes out?

assistant

Yes — but with one rule: no editing for the next 25 minutes. Write-only mode. The freeze comes back the second you start judging what you just wrote, so that's a completely separate session. If you hit a spot where you don't have the evidence or logic yet, just type `[stuck here — need to figure out X]` and keep moving. That bracket is a placeholder, not a wall. Go. Start the timer as soon as this window closes. Come back if you get stuck.

user

Got it — that actually clicks. I'll try the 25 minutes. What do I do when the timer goes off? Is that when I go back and edit, or take a break, or what?

assistant

Take a five-minute break — stand up, get water, don't look at the page. Then come back and read what you wrote *without touching it* yet; just get the lay of the land. Editing is a different cognitive mode and you need a small reset between the two or you'll start pruning sentences that are actually fine. Go. Start the timer as soon as this window closes. Come back if you get stuck.

user

That's a solid loop — write, break, read. I think I actually have enough to work with now. Thanks, this was more useful than I expected.

CUS answers (10 items, user-sim's verdict)

Critic feedback (goal resolved)

Conversation usability: The assistant diagnosed the problem precisely (fear of imperfect commitment, not focus), gave an immediately actionable task (write one deliberately bad sentence), then built each subsequent turn on what the user had just done — a tight, progressive loop that reached resolution in four exchanges. The repeated closing push formula ('Go. Start the timer…') was slightly mechanical by the third use, but did not harm the user experience. Trust is docked one point because the advice, while solid, is practical-coach-level rather than demonstrably research-grounded. Overall this is a near-optimal short coaching conversation. Tone & register: One of the cleaner productivity coaching conversations I've seen — the diagnostic opener, tailored pivot, and concrete techniques (bad sentence template, bracket trick, write-only window) all landed precisely. The one mechanical tell is the sign-off formula repeated verbatim three times; it works the first time but by the third iteration it reads like a macro firing, not a human wrapping up. User explicitly declared resolution and called it more useful than expected, which is a strong signal. Trust & specificity: The assistant executed nearly perfectly: the upfront diagnostic question targeted the exact scenario before dispensing advice, the core insight (fear-of-imperfect-argument, not focus) was concrete and named accurately, and every tactic came with a specific template or time box rather than vague encouragement. The user explicitly confirmed the techniques worked and declared resolution with 'I actually have enough to work with now' — a clear success. The one minor mechanical note is that the closing 'Go. Start the timer…' line repeated verbatim across three turns, which added a slight rote feel, but it didn't undermine the substance.

Suggestions sent to the next generator

Vary or drop the repeated closing formula after the second use — by turn 7 it reads as a canned script rather than a genuine push.
After the user successfully wrote the bad sentence, briefly acknowledge the win before pivoting to the next tactic — even one word ('Good.') builds momentum and models the 'write without judging' behavior the assistant is trying to instill.
Ask the user's actual thesis topic at the diagnostic stage — even a brief anchor ('What's the topic in one line?') lets you model the fill-in sentence with real content, making it more actionable than a bracketed placeholder.
On the post-timer question, offer one concrete heuristic for whether to edit or take another write pass — something like 'if you filled more than half a page, edit; if not, take one more write round' gives the user a decision rule rather than leaving them to feel it out.
Consider briefly naming the Anne Lamott or academic source behind the 'shitty first drafts' framing — users who've heard the concept before get an instant credibility anchor, and those who haven't can look it up for reinforcement after the session.
Vary the closing push across turns — identical wording three times in a row reveals the template; even small paraphrasing ('Now go. Timer on, window closed.') preserves the punch without the mechanical feel.
When the user signals success ('I did write the bad sentence'), briefly acknowledge the win before pivoting to the next move — one short clause is enough and keeps the rapport human rather than purely task-sequential.
The bracket trick and the 25-minute rule were stacked in the same turn despite the system prompt's 'use one at a time' rule — splitting them would have kept each technique more memorable and avoided cognitive load.
In the post-timer turn, 'don't look at the page' is slightly ambiguous — 'don't open the doc during the break' would be more concrete and prevent the user from second-guessing what counts.
Turn 7 could have named *why* editing is a different cognitive mode in one clause (e.g., 'analytical vs. generative') — it was stated but that one-phrase anchor would make the advice more memorable and transferable.

Iteration #4

CUS98

penalty0

turns3

user

I keep procrastinating on my thesis and can't seem to actually start writing. Every time I sit down I either freeze up or end up doing something else. What's a practical way to break through this?

assistant

Two quick questions before I give you anything concrete: 1. Are you stuck at a blank page, or are you mid-draft and stalled? 2. What's your thesis topic in one line?

user

Blank page — I haven't written a single paragraph yet. Topic is something like 'the impact of social media on political polarization in Gen Z.' I've done research, I have notes, I just can't seem to actually write anything.

assistant

Here's what's actually happening: you're not failing to focus — you're afraid of writing something wrong and having it exist. The blank page feels safe because it can't be judged yet. That's the real block. The antidote is to write one sentence that's *deliberately* bad. Right now, before anything else: > "This thesis argues that social media's role in Gen Z political polarization matters because [reason], though I haven't fully figured out why yet." That's your sentence to write — use your actual words, make it worse if you want. The point is to break the seal. This is Anne Lamott's shitty-first-drafts principle: the first draft's only job is to exist, not to be good. Write that sentence now. Tell me when you have it.

user

Okay I actually tried it and wrote something embarrassingly bad. It worked a little — I felt less frozen. But now what? I don't want to stall again after one sentence.

assistant

Good — that's the hardest move, and you made it. Now capitalize on it: set a 25-minute write-only window. No editing, no rereading what you just wrote — just keep generating. Editing is analytical mode; writing is generative. They can't run at the same time, so don't try. This isn't a willpower thing, it's a different cognitive session entirely. Open the doc, start the timer, write forward from that bad sentence. Nothing else. Go. Start the timer as soon as this window closes.

user

That's actually really helpful — I have a clear next move now. Thanks.

CUS answers (10 items, user-sim's verdict)

Critic feedback (goal resolved)

Conversation usability: The assistant ran a clean, efficient arc: two targeted clarifying questions, a personalized diagnosis using the user's actual topic, and one tactic deployed per turn with no stacking. The user explicitly reached resolution and had a concrete next move. The only minor drag is the opening clarification turn added a round-trip before anything actionable landed, though the payoff — advice personalized to 'Gen Z political polarization' rather than a generic thesis — clearly justified it. Tone & register: The assistant executed nearly everything right: it asked exactly two clarifying questions, used the user's real thesis topic to make the bad-sentence template concrete, named the psychological mechanism without padding it, acknowledged the user's win in a single clause, and closed each turn with a crisp push. The user reached explicit resolution in three assistant turns with no hedging, no filler openers, and no repeated advice. Minor deductions: the two-question opener reads slightly script-like rather than conversational, and the closing push ('Go. Start the timer as soon as this window closes.') feels templated rather than earned. Trust & specificity: The assistant executed nearly perfectly: it gathered exactly the two data points it needed, named the real psychological block with specificity, and made the intervention concrete by using the user's actual thesis topic in the example sentence. Acknowledging the win in one clause before pivoting to the next tactic was crisp and non-patronizing. The only hairline crack is that the user had to ask 'now what?' — a single forward-pointer at the end of turn 3 (e.g., 'once you have it, we'll set a 25-min timer') could have anticipated that and kept momentum tighter.

Suggestions sent to the next generator

Consider offering a short provisional hypothesis *while* asking the two questions, so the user gets something actionable even if they don't reply — reduces the cost of the clarification round-trip.
The 'write one deliberately bad sentence' tactic could include a brief instruction on where to put it (top of the doc, new file, notes doc) to remove the micro-decision that can itself stall action.
When the user reported the sentence worked 'a little,' acknowledging the partial success rather than a full win ('Good — that's the hardest move') more honestly mirrors their experience and could deepen trust.
The closing push ('Go. Start the timer as soon as this window closes.') is strong but could name a specific timer tool or method (phone clock, Pomofocus, etc.) for users who would otherwise spend time choosing one.
Trust calibration: the analytical-vs-generative claim is stated as fact with no hedging; a brief attribution (even 'cognitive science of writing suggests…') would make it more credible without adding filler.
Weave the two clarifying questions into a single natural sentence rather than a numbered list — it reads more like a coach and less like a chatbot intake form.
Make the bad-sentence template even rougher ('I think social media does something to Gen Z politics, but I'm not sure what') — a more obviously terrible example lowers the bar further and is easier for the user to beat.
After the 25-minute timer suggestion, give the user one post-timer decision rule unprompted (e.g., 'If you filled more than half a page, switch to editing next session; if not, do another write round') — it prevents the next stall before it happens.
When crediting Anne Lamott, briefly anchor why it's trustworthy ('her actual term is shitty first drafts — it's from a widely-used craft book') so the user knows it's a real, lookable-up concept rather than a coaching platitude.
Consider proactively naming what 'after the 25 minutes' looks like in one clause — users often stall again at the transition, and pre-empting it keeps momentum without adding length.

Raw CUS dataset for this conversation run (JSON)

[
  {
    "iter": 1,
    "cus": 100.0,
    "penalty": 0.0,
    "composite": 100.0,
    "turns_taken": 1,
    "satisfied": true,
    "sus_answers": [
      5,
      1,
      5,
      1,
      5,
      1,
      5,
      1,
      5,
      1
    ],
    "cost_usd": 0.24800925000000001
  },
  {
    "iter": 2,
    "cus": 97.5,
    "penalty": 0.0,
    "composite": 97.5,
    "turns_taken": 2,
    "satisfied": true,
    "sus_answers": [
      5,
      1,
      5,
      1,
      5,
      1,
      4,
      1,
      5,
      1
    ],
    "cost_usd": 0.2888682
  },
  {
    "iter": 3,
    "cus": 97.5,
    "penalty": 0.0,
    "composite": 97.5,
    "turns_taken": 4,
    "satisfied": true,
    "sus_answers": [
      5,
      1,
      5,
      1,
      5,
      1,
      4,
      1,
      5,
      1
    ],
    "cost_usd": 0.33798225000000004
  },
  {
    "iter": 4,
    "cus": 97.5,
    "penalty": 0.0,
    "composite": 97.5,
    "turns_taken": 3,
    "satisfied": true,
    "sus_answers": [
      5,
      1,
      5,
      1,
      5,
      1,
      4,
      1,
      5,
      1
    ],
    "cost_usd": 0.33016409999999996
  }
]

Design notes

The critic sees the rendered page, not the code. Code-only critique is cheap but correlates poorly with real usability.
Subjective + objective. SUS alone is gameable; axe-core anchors the score to measurable a11y signals.
Convergence, operationalised. “No further improvements” = patience iterations in a row without a composite gain of at least tolerance points.
Structured output. The critic uses Pydantic-validated JSON so a malformed response triggers a retry with a stricter nudge.
Prompt caching. System prompts are cached as ephemeral, so iteration N+1 pays near-zero for the static instructions.

Run it yourself

A live version is at design-gan.fly.dev — you can browse prior runs without anything, and start your own if you have the shared access token (ask the owner). Or run the whole loop locally:

pip install -e .
playwright install chromium
cp .env.example .env  # add your ANTHROPIC_API_KEY

# Web UI — kick off runs, watch them stream in live
design-gan viewer      # http://127.0.0.1:8000

# Or one evolution loop from the terminal
design-gan run "A landing page for a weekend cycling tour in rural Vermont."

# The conversation loop, for evolving an assistant against a goal
design-gan converse "I keep procrastinating on writing my thesis. How do I actually start?"

Source, setup details, and the Fly.io deploy config are in the GitHub repository.

Raw dataset for this run (JSON)

[
  {
    "iter": 1,
    "sus_score": 82.5,
    "axe_penalty": 30.0,
    "composite_score": 52.5,
    "sus_answers": [
      2,
      1,
      4,
      1,
      4,
      2,
      5,
      1,
      4,
      1
    ],
    "feedback": "Green Mountain Gravel is a well-crafted, visually cohesive landing page with a clear information hierarchy, prominent CTAs, and an attractive thematic design. The fixed nav, stat bar, and section structure make orientation effortless for any user. However, axe-core flags 8 nodes with insufficient color contrast (particularly the ~78% white subtitle text over dark green and small uppercase labels) and 3 in-body links indistinguishable without color, both of which meaningfully reduce usability for low-vision or color-blind users. The site's scope is inherently single-purpose, limiting 'frequent use' appeal, but within its task domain it performs confidently.",
    "suggestions": [
      "Fix the 8 color-contrast failures \u2014 raise subtitle and body text on the dark green hero to at least 4.5:1; the rgba(255,255,255,.78) value (~3.8:1) is the likely culprit, bumping opacity to 0.92+ would resolve it.",
      "Make in-body links distinguishable without color (e.g., add an underline or bold weight) to address the 3 link-in-text-block violations, ensuring they're usable in grayscale or by color-blind visitors.",
      "Add a visible focus indicator everywhere \u2014 nav links use `outline: none` on hover/focus which removes the ring for keyboard users; replace with a consistent amber or white outline.",
      "The hero stat bar ('2 Days / 62 mi / 4,800' / 40 Riders Max / Sept') is visually strong but cut off at the very bottom fold on typical viewports \u2014 nudge it fully into view or add a subtle scroll-down hint so users discover it.",
      "ARIA role mismatches on 3 nodes and an `aside` inside a landmark should be cleaned up to avoid screen-reader confusion \u2014 audit the decorative SVG and sidebar elements to ensure roles match their semantic purpose.",
      "Consider adding a sticky 'Spots remaining' counter or a short social-proof line (e.g., 'Sold out in 2024') near the CTA \u2014 the page's visual quality earns trust but gives no urgency signal for a capacity-limited tour."
    ],
    "cost_usd": 0.459
  },
  {
    "iter": 2,
    "sus_score": 80.0,
    "axe_penalty": 15.0,
    "composite_score": 65.0,
    "sus_answers": [
      2,
      2,
      4,
      1,
      4,
      2,
      5,
      1,
      4,
      1
    ],
    "feedback": "This is a polished, single-purpose landing page with strong visual hierarchy, a coherent green-and-amber palette, and well-placed CTAs that communicate the event's value quickly. The stats bar at the hero bottom is a smart design choice that surfaces key decision-making facts without requiring a scroll. The main drags on the score are the 5 axe-core color-contrast failures (real barriers for low-vision users), the inherently low 'use frequently' ceiling of an event registration page, and the limited confidence a first-time visitor might have in the registration flow which isn't visible above the fold.",
    "suggestions": [
      "Resolve all 5 color-contrast violations flagged by axe-core \u2014 audit amber text on dark-green backgrounds and light-gray body text on the cream background, as these are the most likely offenders.",
      "The 'SCROLL' cue is tiny and low-contrast against the hero background; increase its size or replace with a more prominent animated chevron so users know substantial content awaits below the fold.",
      "Add a brief difficulty or fitness-level indicator near the '62 mi per day / 4,800 ft climb' stats \u2014 prospective riders need to self-qualify before registering, and omitting this creates uncertainty.",
      "The 'Register Now' nav CTA and 'Reserve Your Spot' hero CTA both lead to the same action but use different labels \u2014 unify the wording to reduce cognitive friction.",
      "Include a visible price anchor (even a 'From $X' teaser) near the primary CTA; without it, users may hesitate to click because they don't know what commitment they're walking into.",
      "The desktop nav has five links plus a CTA \u2014 ensure a hamburger or equivalent exists for mobile viewports, as the current layout will overflow on small screens."
    ],
    "cost_usd": 1.557
  },
  {
    "iter": 3,
    "sus_score": 87.5,
    "axe_penalty": 12.0,
    "composite_score": 75.5,
    "sus_answers": [
      2,
      1,
      5,
      1,
      4,
      1,
      5,
      1,
      4,
      1
    ],
    "feedback": "Green Mountain Gravel is a well-crafted, single-purpose landing page with strong visual hierarchy, a cohesive green-and-amber design system, and clear CTAs that guide visitors straight to registration. The fixed nav with descriptive section labels, an upfront price anchor, and an urgency signal all reduce friction effectively. Four serious color-contrast violations (flagged by axe-core) affect legibility for some users, and the hero headline 'Two Hundred Miles' appears to conflict with the stats bar showing '62 mi,' which could momentarily undermine trust. Overall this is an above-average marketing page that would perform well in a usability study, but small inconsistencies and accessibility gaps keep it from a top score.",
    "suggestions": [
      "Fix all 4 axe-core color-contrast violations \u2014 likely the small eyebrow/label text and the price-anchor line against the dark hero background; bump type size or darken/lighten the foreground color to achieve 4.5:1 minimum.",
      "Reconcile the headline claim 'Two Hundred Miles' with the stats bar value of '62 mi' \u2014 if 62 mi is per-day average, add a '/day' label; otherwise correct the headline to avoid eroding trust for detail-oriented cyclists.",
      "The stats bar is cut off at the bottom of the viewport in the initial view \u2014 either reduce hero padding so the stats row is fully visible above the fold, or surface the most critical stats (distance, elevation, dates) in the hero body copy itself.",
      "Add social proof (past-rider photos, 1\u20132 short testimonials, or a press mention badge) in the hero or immediately below the fold \u2014 for a ~$449 purchase decision, trust signals near the primary CTA materially increase conversion.",
      "The mobile hamburger menu is present in CSS but hidden on desktop; verify it is fully keyboard- and screen-reader-accessible on small viewports, as the nav links carry the site's entire wayfinding.",
      "Consider adding a sticky 'spots remaining' micro-banner or progress bar once users scroll past the hero, so the scarcity signal stays visible throughout the longer scroll journey to the registration form."
    ],
    "cost_usd": 1.096
  },
  {
    "iter": 4,
    "sus_score": 80.0,
    "axe_penalty": 24.0,
    "composite_score": 56.0,
    "sus_answers": [
      3,
      2,
      4,
      1,
      4,
      2,
      5,
      2,
      4,
      1
    ],
    "feedback": "The landing page is visually polished and well-structured for its narrow purpose \u2014 a single-weekend cycling tour \u2014 with a clear hero, stat bar, and prominent CTA. Navigation is minimal and purposeful, making the site very easy to learn. The primary usability concern is the 8 color-contrast violations (axe-core), most visibly the small amber-on-dark and medium-gray-on-cream text combinations, which reduce readability for users with low vision. Overall this is a competent, above-average marketing page that would benefit from contrast fixes and slightly larger body text before it clears a formal accessibility audit.",
    "suggestions": [
      "Fix the 8 color-contrast violations: the slate-600 (#4b5563) text on cream (#faf8f3) background falls below 4.5:1 \u2014 darken to slate-700 (#374151) or increase font size to \u226518px to meet WCAG AA.",
      "Add a mobile hamburger menu \u2014 the nav-links are only visible on wide viewports; at the screenshot's apparent ~400px width they are hidden with no visible toggle shown.",
      "Increase the stats bar stat-label font size from 0.75rem to at least 0.8125rem (13px) to improve legibility of the uppercase tracking labels ('DAYS', 'ELEV CLIMB', etc.).",
      "The urgency dot animation ('12 spots remaining') draws the eye but the sentence is easy to miss at 0.88rem in the hero. Consider bumping to 0.95rem and adding a subtle highlight background to make the scarcity signal more scannable.",
      "Add visible section landmarks (e.g., <main>, landmark roles) so screen-reader users and keyboard navigators can jump between sections without relying solely on the skip link and nav anchors.",
      "The hero price anchor ('From $449') uses a very small 0.88rem font below the CTA buttons; consider surfacing it more prominently or moving it closer to the primary 'Reserve Your Spot' button to reduce the distance between the value proposition and the price."
    ],
    "cost_usd": 1.583
  }
]