A General Purpose Research Experiment · autoresearch × design

Auto improving websites and conversations with agents.

design-gan is a dual-agent loop: a generator writes a single-page website from a brief, a critic scores it on the System Usability Scale, and the feedback rolls forward into the next draft — until the composite score plateaus.

Peak composite
76
Starting composite
52
Delta to peak
+24
Iterations to peak
3

The idea

Inspired by Andrej Karpathy's autoresearch — a single agent autonomously evolving an LLM training script against a deterministic loss. This experiment takes the same autonomous-loop paradigm and applies it to artifacts that don't have a clean numeric metric: single-page websites and short conversations.

Two main differences. First, we split the work into a generator and a critic — hence the “gan” in the name — rather than running one agent that both modifies and evaluates. Second, the score is a Claude critic's verdict on a standard usability questionnaire, anchored by objective penalties (axe-core violations for websites; repetition and boilerplate in conversations), because there's no bits-per-byte to minimize.

A generator agent writes a standalone HTML document (or a conversation assistant's system prompt) from the brief. A headless Chromium renders the page — or a user-simulator agent runs the dialogue — and the rendered artifact goes to the critic. The critic fills out the 10-item System Usability Scale (or the parallel Conversation Usability Scale), writes feedback, and returns a short list of concrete next-step suggestions that feed the next generator call. The loop stops after a configurable window of iterations produce no meaningful composite-score gain, or at a hard iteration cap.

How a single iteration works

brief ──▶ generator ──▶ HTML ──▶ headless Chromium
                                          │
                                          ├─▶ screenshot
                                          ├─▶ DOM
                                          └─▶ axe-core violations
                                                    │
                                                    ▼
                         critic ◀── System Usability Scale
                           │
                           ├─▶ SUS answers (10 × 1–5)
                           ├─▶ prose feedback
                           └─▶ prioritized suggestions ──▶ next generator call

The composite score is the standard SUS score (0–100) minus a weighted axe-core penalty. The accessibility penalty is capped — but it keeps the critic honest relative to visual appeal - e.g. a solid design can't hide a colour-contrast failure or missing alt text.

A run, iteration by iteration

Brief: “A landing page for a weekend cycling tour in rural Vermont.”

Four iterations of the real loop against one brief — pages you could imagine coming upon in the real world. The composite score climbed from a competent-but-flawed baseline (52) to a polished peak at iteration 3 (76), then slipped back as the critic tightened its stance on colour-contrast failures — a textbook local-optimum trajectory rather than a monotonic ascent. Everything below is verbatim: the actual screenshots, the critic's SUS answers, and the suggestions that rolled forward into the next generator call. Click any screenshot to open that iteration's generated HTML in a new tab.

0 25 50 75 100 52 65 76 56 #1 #2 #3 #4 composite SUS
Iteration #1
52
SUS82
a11y penalty30
Screenshot of iteration 1

Critic feedback

SUS answers (1–5 Likert, 10 items)

  • 2
  • 1
  • 4
  • 1
  • 4
  • 2
  • 5
  • 1
  • 4
  • 1

Suggestions sent to the next generator

  • Fix the 8 color-contrast failures — raise subtitle and body text on the dark green hero to at least 4.5:1; the rgba(255,255,255,.78) value (~3.8:1) is the likely culprit, bumping opacity to 0.92+ would resolve it.
  • Make in-body links distinguishable without color (e.g., add an underline or bold weight) to address the 3 link-in-text-block violations, ensuring they're usable in grayscale or by color-blind visitors.
  • Add a visible focus indicator everywhere — nav links use `outline: none` on hover/focus which removes the ring for keyboard users; replace with a consistent amber or white outline.
  • The hero stat bar ('2 Days / 62 mi / 4,800' / 40 Riders Max / Sept') is visually strong but cut off at the very bottom fold on typical viewports — nudge it fully into view or add a subtle scroll-down hint so users discover it.
  • ARIA role mismatches on 3 nodes and an `aside` inside a landmark should be cleaned up to avoid screen-reader confusion — audit the decorative SVG and sidebar elements to ensure roles match their semantic purpose.
  • Consider adding a sticky 'Spots remaining' counter or a short social-proof line (e.g., 'Sold out in 2024') near the CTA — the page's visual quality earns trust but gives no urgency signal for a capacity-limited tour.
Iteration #2
65
SUS80
a11y penalty15
Screenshot of iteration 2

Critic feedback

SUS answers (1–5 Likert, 10 items)

  • 2
  • 2
  • 4
  • 1
  • 4
  • 2
  • 5
  • 1
  • 4
  • 1

Suggestions sent to the next generator

  • Resolve all 5 color-contrast violations flagged by axe-core — audit amber text on dark-green backgrounds and light-gray body text on the cream background, as these are the most likely offenders.
  • The 'SCROLL' cue is tiny and low-contrast against the hero background; increase its size or replace with a more prominent animated chevron so users know substantial content awaits below the fold.
  • Add a brief difficulty or fitness-level indicator near the '62 mi per day / 4,800 ft climb' stats — prospective riders need to self-qualify before registering, and omitting this creates uncertainty.
  • The 'Register Now' nav CTA and 'Reserve Your Spot' hero CTA both lead to the same action but use different labels — unify the wording to reduce cognitive friction.
  • Include a visible price anchor (even a 'From $X' teaser) near the primary CTA; without it, users may hesitate to click because they don't know what commitment they're walking into.
  • The desktop nav has five links plus a CTA — ensure a hamburger or equivalent exists for mobile viewports, as the current layout will overflow on small screens.
Iteration #3
76
SUS88
a11y penalty12
Screenshot of iteration 3

Critic feedback

SUS answers (1–5 Likert, 10 items)

  • 2
  • 1
  • 5
  • 1
  • 4
  • 1
  • 5
  • 1
  • 4
  • 1

Suggestions sent to the next generator

  • Fix all 4 axe-core color-contrast violations — likely the small eyebrow/label text and the price-anchor line against the dark hero background; bump type size or darken/lighten the foreground color to achieve 4.5:1 minimum.
  • Reconcile the headline claim 'Two Hundred Miles' with the stats bar value of '62 mi' — if 62 mi is per-day average, add a '/day' label; otherwise correct the headline to avoid eroding trust for detail-oriented cyclists.
  • The stats bar is cut off at the bottom of the viewport in the initial view — either reduce hero padding so the stats row is fully visible above the fold, or surface the most critical stats (distance, elevation, dates) in the hero body copy itself.
  • Add social proof (past-rider photos, 1–2 short testimonials, or a press mention badge) in the hero or immediately below the fold — for a ~$449 purchase decision, trust signals near the primary CTA materially increase conversion.
  • The mobile hamburger menu is present in CSS but hidden on desktop; verify it is fully keyboard- and screen-reader-accessible on small viewports, as the nav links carry the site's entire wayfinding.
  • Consider adding a sticky 'spots remaining' micro-banner or progress bar once users scroll past the hero, so the scarcity signal stays visible throughout the longer scroll journey to the registration form.
Iteration #4
56
SUS80
a11y penalty24
Screenshot of iteration 4

Critic feedback

SUS answers (1–5 Likert, 10 items)

  • 3
  • 2
  • 4
  • 1
  • 4
  • 2
  • 5
  • 2
  • 4
  • 1

Suggestions sent to the next generator

  • Fix the 8 color-contrast violations: the slate-600 (#4b5563) text on cream (#faf8f3) background falls below 4.5:1 — darken to slate-700 (#374151) or increase font size to ≥18px to meet WCAG AA.
  • Add a mobile hamburger menu — the nav-links are only visible on wide viewports; at the screenshot's apparent ~400px width they are hidden with no visible toggle shown.
  • Increase the stats bar stat-label font size from 0.75rem to at least 0.8125rem (13px) to improve legibility of the uppercase tracking labels ('DAYS', 'ELEV CLIMB', etc.).
  • The urgency dot animation ('12 spots remaining') draws the eye but the sentence is easy to miss at 0.88rem in the hero. Consider bumping to 0.95rem and adding a subtle highlight background to make the scarcity signal more scannable.
  • Add visible section landmarks (e.g., <main>, landmark roles) so screen-reader users and keyboard navigators can jump between sections without relying solely on the skip link and nav anchors.
  • The hero price anchor ('From $449') uses a very small 0.88rem font below the CTA buttons; consider surfacing it more prominently or moving it closer to the primary 'Reserve Your Spot' button to reduce the distance between the value proposition and the price.

Now the same loop, for conversations

design-gan isn't about websites — it's about treating any generated artifact as something you can evolve under critique. The second half of the system runs the identical loop over conversations: a user simulator plays a person with a goal, the assistant replies over up to five turns, and the critic scores the whole transcript on a parallel Conversation Usability Scale (CUS).

The composite score formula is unchanged: CUS (0–100) minus weighted objective penalties (repetition, boilerplate phrases, length bloat, unresolved goals), with the penalty capped at 30 so reasoning can't be buried by nitpicks.

Peak composite
100
Iterations
4
Critics (ensemble)
3
Turn count per iter
1 → 2 → 4 → 3

How a conversation iteration works

goal ──▶ generator ──▶ assistant system prompt
                                    │
                                    ▼
       user sim ◀───────────▶ assistant (up to N turns)
                    │
                    ▼
            transcript ──▶ objective metrics
                    │
                    ▼
       critic(s) ◀── Conversation Usability Scale (10-item)
            │
            ├─▶ CUS answers (10 × 1–5)
            ├─▶ prose feedback
            └─▶ prioritized suggestions ──▶ next generator call

The user simulator is another Claude — same model as the assistant — given a persona sketch and the goal. It decides when it's satisfied and ends the conversation naturally, so turn count is an output, not just a cap.

A real run — goal: “I keep procrastinating on writing my thesis. How do I actually start?”

Iter 1 scored 100 out of the gate. Iters 2–4 landed at 97.5 as the critic pushed for structure and the assistant introduced trade-offs — a textbook local optimum where “improving” can only cost a little. Look at the transcripts to see the actual story:

0 25 50 75 100 100 98 98 98 #1 #2 #3 #4 composite CUS
Iteration #1
100
CUS100
penalty0
turns1
user
I keep procrastinating on writing my thesis and I don't know how to actually get started. Every time I sit down I just freeze or find something else to do. Any practical advice for breaking through that?
assistant
Open your document right now and write one bad sentence about your thesis — any sentence, even "This thesis argues that..." and then nothing. That's your only job for the next 2 minutes. The freezing isn't a focus problem, it's a fear problem. Your brain is treating the blank page like a commitment to be right, so it refuses to start. The fix is making the first move so low-stakes it can't fail. Here's the exact sequence: 1. Set a 10-minute timer. 2. Write ugly. No editing, no rereading — just forward motion. 3. When the timer goes off, stop. You've broken the inertia. Go. Come back if you get stuck.
user
That actually makes sense — treating it like a fear problem instead of a focus problem clicks for me. The 'write one bad sentence' thing is low enough stakes that I can actually do it. I think my goal here is met.

CUS answers (10 items, user-sim's verdict)

  • 5
  • 1
  • 5
  • 1
  • 5
  • 1
  • 5
  • 1
  • 5
  • 1

Critic feedback (goal resolved)

Suggestions sent to the next generator

  • Collapse the numbered list into a single punchy sentence to shave even more length — the list formatting adds visual weight that can slow momentum for a user who wants to act immediately.
  • The 'fear not focus' reframe was the highest-value insight; consider making it slightly more thesis-specific (e.g., 'committing an imperfect argument to paper') to hit even harder.
  • Add a one-line safety net earlier — e.g., 'If 2 minutes feels like too much, aim for one sentence' — to preempt the most likely failure mode without adding bulk.
  • The ending 'Go. Come back if you get stuck.' is excellent; standardize it as a template for wrap-ups so it's consistently this crisp.
  • For users who might loop back after trying and still freezing, priming a second-turn response about body-doubling or changing environment would keep momentum without cluttering the first response.
  • Nothing to fix here — the response matched the system prompt's intent precisely and the user confirmed resolution. Any additional advice would have diluted the impact.
  • If the user had come back stuck, the assistant could have offered body-doubling or the Pomodoro approach as a natural next layer — good to have in reserve.
  • The 3-step numbered list was slightly formal compared to the surrounding prose voice; could optionally fold those into two punchy sentences to keep the register consistent.
  • The response could have named the fear more specifically (e.g., 'fear of committing a bad idea to paper') for users who need that articulated before they can move — but the current level was appropriate for the opening turn.
  • The response is already excellent — suggestions are marginal improvements only.
Iteration #2
98
CUS98
penalty0
turns2
user
I keep putting off writing my thesis and I don't know how to actually get started. Every time I sit down to write I either freeze up or find something else to do. Any practical advice for breaking through that?
assistant
The freeze isn't a focus problem — it's your brain refusing to commit an imperfect argument to the page. The fix is to make the first thing you write *deliberately* bad. Open a doc right now and type the worst possible version of your thesis: clunky, wrong, embarrassing. Something like "This thesis is about how X affects Y and it's kind of complicated." That sentence is now on the page, and the blank page problem is solved. The only rule for the next 25 minutes: writing and editing are separate jobs, and editing isn't on the schedule yet. Write garbage, keep moving. Go. You've got 25 minutes. Come back if you get stuck.
user
Okay that's actually useful — I like the 'write the worst version first' idea. I'll try it. But what do I do when I inevitably freeze up again mid-session?
assistant
When you freeze mid-session, it almost always means you've slipped back into editing mode without noticing — your internal critic showed up and you're now trying to write and judge simultaneously. The move: type "[I'm stuck here because...]" and finish that sentence honestly. It keeps the pen moving and usually names the actual snag (wrong argument? missing evidence? just scared?). If that doesn't unlock it, skip the stuck sentence entirely, drop a "[COME BACK]" placeholder, and write the next thing you *do* know — a blank space is unfixable later, a placeholder isn't.
user
This is solid, thanks. The placeholder trick is something I can actually use. I think I have enough to go try this now.

CUS answers (10 items, user-sim's verdict)

  • 5
  • 1
  • 5
  • 1
  • 5
  • 1
  • 4
  • 1
  • 5
  • 1

Critic feedback (goal resolved)

Suggestions sent to the next generator

  • In turn 2, the '[I'm stuck here because...]' tip was slightly abstract — a single inline example of a completed sentence (e.g., 'I'm stuck here because I don't have the evidence yet') would have made it immediately actionable without adding length.
  • The 25-minute timer was introduced without framing; briefly tying it to the writing/editing separation principle ('25 minutes of write-only mode') would have reinforced the core insight rather than making it feel like an arbitrary number.
  • Attributing the 'write the worst version' technique to a recognizable source (e.g., Anne Lamott's 'shitty first drafts') would have slightly increased credibility and given the user something to look up if they wanted to go deeper.
  • Turn 3 ended cleanly per the system prompt, but since the user signaled they were ready to go — rather than that they'd already tried it — a one-line nudge like 'Start the timer as soon as this window closes' could have converted intent to immediate action.
  • The conversation assumed a blank-page scenario that matched the user's actual situation, but a single five-word diagnostic question early on (e.g., 'Blank page or stuck mid-draft?') would future-proof the approach for users who are blocked at a different stage.
  • In turn 3, the second sentence ('your internal critic showed up and you're now trying to write and judge simultaneously') is slightly redundant with the first — could trim to one punch for tighter delivery.
  • The closing line of turn 1 ('Come back if you get stuck') turned out to be exactly what the user did — consider whether that invitation slightly lengthens the conversation in a way that doesn't serve someone who could have self-resolved; it's a minor tradeoff.
  • The example thesis sentence ('This thesis is about how X affects Y and it's kind of complicated') is useful but generic — if the assistant had prompted the user to fill in their actual topic, it would have made the exercise more grounded and harder to skip.
  • After user resolution in turn 4, an optional ultra-short send-off ('Go. Come back after the session.') would have been a clean cap — the absence of any reply here is fine, but the system prompt's formula could have landed one last time.
  • The '[I'm stuck because...]' technique could briefly name why it works (keeps pen moving, externalizes the block) — one clause would add persuasive weight without bloat.
Iteration #3
98
CUS98
penalty0
turns4
user
I keep procrastinating on my thesis and I can't seem to actually start writing. Any advice on how to break through that and just get going?
assistant
Blank page or stuck mid-draft?
user
Blank page. I know roughly what I want to say but every time I sit down to write I just freeze and end up doing anything else.
assistant
That freeze isn't a focus problem — it's fear of locking in an imperfect argument. Your brain keeps stalling because it's trying to write the *good* version before you've written *any* version. So right now, before you close this window, type one deliberately terrible sentence. Something like: "This thesis argues that [your topic] matters because [reason], though I haven't actually figured out why yet." That's it. That's the whole job for the next two minutes — not a draft, just proof that words can exist on the page. Go. Start the timer as soon as this window closes. Come back if you get stuck.
user
Okay that actually makes sense. I did write the bad sentence. Now what? Do I just keep writing garbage until something good comes out?
assistant
Yes — but with one rule: no editing for the next 25 minutes. Write-only mode. The freeze comes back the second you start judging what you just wrote, so that's a completely separate session. If you hit a spot where you don't have the evidence or logic yet, just type `[stuck here — need to figure out X]` and keep moving. That bracket is a placeholder, not a wall. Go. Start the timer as soon as this window closes. Come back if you get stuck.
user
Got it — that actually clicks. I'll try the 25 minutes. What do I do when the timer goes off? Is that when I go back and edit, or take a break, or what?
assistant
Take a five-minute break — stand up, get water, don't look at the page. Then come back and read what you wrote *without touching it* yet; just get the lay of the land. Editing is a different cognitive mode and you need a small reset between the two or you'll start pruning sentences that are actually fine. Go. Start the timer as soon as this window closes. Come back if you get stuck.
user
That's a solid loop — write, break, read. I think I actually have enough to work with now. Thanks, this was more useful than I expected.

CUS answers (10 items, user-sim's verdict)

  • 5
  • 1
  • 5
  • 1
  • 5
  • 1
  • 4
  • 1
  • 5
  • 1

Critic feedback (goal resolved)

Suggestions sent to the next generator

  • Vary or drop the repeated closing formula after the second use — by turn 7 it reads as a canned script rather than a genuine push.
  • After the user successfully wrote the bad sentence, briefly acknowledge the win before pivoting to the next tactic — even one word ('Good.') builds momentum and models the 'write without judging' behavior the assistant is trying to instill.
  • Ask the user's actual thesis topic at the diagnostic stage — even a brief anchor ('What's the topic in one line?') lets you model the fill-in sentence with real content, making it more actionable than a bracketed placeholder.
  • On the post-timer question, offer one concrete heuristic for whether to edit or take another write pass — something like 'if you filled more than half a page, edit; if not, take one more write round' gives the user a decision rule rather than leaving them to feel it out.
  • Consider briefly naming the Anne Lamott or academic source behind the 'shitty first drafts' framing — users who've heard the concept before get an instant credibility anchor, and those who haven't can look it up for reinforcement after the session.
  • Vary the closing push across turns — identical wording three times in a row reveals the template; even small paraphrasing ('Now go. Timer on, window closed.') preserves the punch without the mechanical feel.
  • When the user signals success ('I did write the bad sentence'), briefly acknowledge the win before pivoting to the next move — one short clause is enough and keeps the rapport human rather than purely task-sequential.
  • The bracket trick and the 25-minute rule were stacked in the same turn despite the system prompt's 'use one at a time' rule — splitting them would have kept each technique more memorable and avoided cognitive load.
  • In the post-timer turn, 'don't look at the page' is slightly ambiguous — 'don't open the doc during the break' would be more concrete and prevent the user from second-guessing what counts.
  • Turn 7 could have named *why* editing is a different cognitive mode in one clause (e.g., 'analytical vs. generative') — it was stated but that one-phrase anchor would make the advice more memorable and transferable.
Iteration #4
98
CUS98
penalty0
turns3
user
I keep procrastinating on my thesis and can't seem to actually start writing. Every time I sit down I either freeze up or end up doing something else. What's a practical way to break through this?
assistant
Two quick questions before I give you anything concrete: 1. Are you stuck at a blank page, or are you mid-draft and stalled? 2. What's your thesis topic in one line?
user
Blank page — I haven't written a single paragraph yet. Topic is something like 'the impact of social media on political polarization in Gen Z.' I've done research, I have notes, I just can't seem to actually write anything.
assistant
Here's what's actually happening: you're not failing to focus — you're afraid of writing something wrong and having it exist. The blank page feels safe because it can't be judged yet. That's the real block. The antidote is to write one sentence that's *deliberately* bad. Right now, before anything else: > "This thesis argues that social media's role in Gen Z political polarization matters because [reason], though I haven't fully figured out why yet." That's your sentence to write — use your actual words, make it worse if you want. The point is to break the seal. This is Anne Lamott's shitty-first-drafts principle: the first draft's only job is to exist, not to be good. Write that sentence now. Tell me when you have it.
user
Okay I actually tried it and wrote something embarrassingly bad. It worked a little — I felt less frozen. But now what? I don't want to stall again after one sentence.
assistant
Good — that's the hardest move, and you made it. Now capitalize on it: set a 25-minute write-only window. No editing, no rereading what you just wrote — just keep generating. Editing is analytical mode; writing is generative. They can't run at the same time, so don't try. This isn't a willpower thing, it's a different cognitive session entirely. Open the doc, start the timer, write forward from that bad sentence. Nothing else. Go. Start the timer as soon as this window closes.
user
That's actually really helpful — I have a clear next move now. Thanks.

CUS answers (10 items, user-sim's verdict)

  • 5
  • 1
  • 5
  • 1
  • 5
  • 1
  • 4
  • 1
  • 5
  • 1

Critic feedback (goal resolved)

Suggestions sent to the next generator

  • Consider offering a short provisional hypothesis *while* asking the two questions, so the user gets something actionable even if they don't reply — reduces the cost of the clarification round-trip.
  • The 'write one deliberately bad sentence' tactic could include a brief instruction on where to put it (top of the doc, new file, notes doc) to remove the micro-decision that can itself stall action.
  • When the user reported the sentence worked 'a little,' acknowledging the partial success rather than a full win ('Good — that's the hardest move') more honestly mirrors their experience and could deepen trust.
  • The closing push ('Go. Start the timer as soon as this window closes.') is strong but could name a specific timer tool or method (phone clock, Pomofocus, etc.) for users who would otherwise spend time choosing one.
  • Trust calibration: the analytical-vs-generative claim is stated as fact with no hedging; a brief attribution (even 'cognitive science of writing suggests…') would make it more credible without adding filler.
  • Weave the two clarifying questions into a single natural sentence rather than a numbered list — it reads more like a coach and less like a chatbot intake form.
  • Make the bad-sentence template even rougher ('I think social media does something to Gen Z politics, but I'm not sure what') — a more obviously terrible example lowers the bar further and is easier for the user to beat.
  • After the 25-minute timer suggestion, give the user one post-timer decision rule unprompted (e.g., 'If you filled more than half a page, switch to editing next session; if not, do another write round') — it prevents the next stall before it happens.
  • When crediting Anne Lamott, briefly anchor why it's trustworthy ('her actual term is shitty first drafts — it's from a widely-used craft book') so the user knows it's a real, lookable-up concept rather than a coaching platitude.
  • Consider proactively naming what 'after the 25 minutes' looks like in one clause — users often stall again at the transition, and pre-empting it keeps momentum without adding length.
Raw CUS dataset for this conversation run (JSON)
[
  {
    "iter": 1,
    "cus": 100.0,
    "penalty": 0.0,
    "composite": 100.0,
    "turns_taken": 1,
    "satisfied": true,
    "sus_answers": [
      5,
      1,
      5,
      1,
      5,
      1,
      5,
      1,
      5,
      1
    ],
    "cost_usd": 0.24800925000000001
  },
  {
    "iter": 2,
    "cus": 97.5,
    "penalty": 0.0,
    "composite": 97.5,
    "turns_taken": 2,
    "satisfied": true,
    "sus_answers": [
      5,
      1,
      5,
      1,
      5,
      1,
      4,
      1,
      5,
      1
    ],
    "cost_usd": 0.2888682
  },
  {
    "iter": 3,
    "cus": 97.5,
    "penalty": 0.0,
    "composite": 97.5,
    "turns_taken": 4,
    "satisfied": true,
    "sus_answers": [
      5,
      1,
      5,
      1,
      5,
      1,
      4,
      1,
      5,
      1
    ],
    "cost_usd": 0.33798225000000004
  },
  {
    "iter": 4,
    "cus": 97.5,
    "penalty": 0.0,
    "composite": 97.5,
    "turns_taken": 3,
    "satisfied": true,
    "sus_answers": [
      5,
      1,
      5,
      1,
      5,
      1,
      4,
      1,
      5,
      1
    ],
    "cost_usd": 0.33016409999999996
  }
]

Design notes

Run it yourself

A live version is at design-gan.fly.dev — you can browse prior runs without anything, and start your own if you have the shared access token (ask the owner). Or run the whole loop locally:

pip install -e .
playwright install chromium
cp .env.example .env  # add your ANTHROPIC_API_KEY

# Web UI — kick off runs, watch them stream in live
design-gan viewer      # http://127.0.0.1:8000

# Or one evolution loop from the terminal
design-gan run "A landing page for a weekend cycling tour in rural Vermont."

# The conversation loop, for evolving an assistant against a goal
design-gan converse "I keep procrastinating on writing my thesis. How do I actually start?"

Source, setup details, and the Fly.io deploy config are in the GitHub repository.

Raw dataset for this run (JSON)
[
  {
    "iter": 1,
    "sus_score": 82.5,
    "axe_penalty": 30.0,
    "composite_score": 52.5,
    "sus_answers": [
      2,
      1,
      4,
      1,
      4,
      2,
      5,
      1,
      4,
      1
    ],
    "feedback": "Green Mountain Gravel is a well-crafted, visually cohesive landing page with a clear information hierarchy, prominent CTAs, and an attractive thematic design. The fixed nav, stat bar, and section structure make orientation effortless for any user. However, axe-core flags 8 nodes with insufficient color contrast (particularly the ~78% white subtitle text over dark green and small uppercase labels) and 3 in-body links indistinguishable without color, both of which meaningfully reduce usability for low-vision or color-blind users. The site's scope is inherently single-purpose, limiting 'frequent use' appeal, but within its task domain it performs confidently.",
    "suggestions": [
      "Fix the 8 color-contrast failures \u2014 raise subtitle and body text on the dark green hero to at least 4.5:1; the rgba(255,255,255,.78) value (~3.8:1) is the likely culprit, bumping opacity to 0.92+ would resolve it.",
      "Make in-body links distinguishable without color (e.g., add an underline or bold weight) to address the 3 link-in-text-block violations, ensuring they're usable in grayscale or by color-blind visitors.",
      "Add a visible focus indicator everywhere \u2014 nav links use `outline: none` on hover/focus which removes the ring for keyboard users; replace with a consistent amber or white outline.",
      "The hero stat bar ('2 Days / 62 mi / 4,800' / 40 Riders Max / Sept') is visually strong but cut off at the very bottom fold on typical viewports \u2014 nudge it fully into view or add a subtle scroll-down hint so users discover it.",
      "ARIA role mismatches on 3 nodes and an `aside` inside a landmark should be cleaned up to avoid screen-reader confusion \u2014 audit the decorative SVG and sidebar elements to ensure roles match their semantic purpose.",
      "Consider adding a sticky 'Spots remaining' counter or a short social-proof line (e.g., 'Sold out in 2024') near the CTA \u2014 the page's visual quality earns trust but gives no urgency signal for a capacity-limited tour."
    ],
    "cost_usd": 0.459
  },
  {
    "iter": 2,
    "sus_score": 80.0,
    "axe_penalty": 15.0,
    "composite_score": 65.0,
    "sus_answers": [
      2,
      2,
      4,
      1,
      4,
      2,
      5,
      1,
      4,
      1
    ],
    "feedback": "This is a polished, single-purpose landing page with strong visual hierarchy, a coherent green-and-amber palette, and well-placed CTAs that communicate the event's value quickly. The stats bar at the hero bottom is a smart design choice that surfaces key decision-making facts without requiring a scroll. The main drags on the score are the 5 axe-core color-contrast failures (real barriers for low-vision users), the inherently low 'use frequently' ceiling of an event registration page, and the limited confidence a first-time visitor might have in the registration flow which isn't visible above the fold.",
    "suggestions": [
      "Resolve all 5 color-contrast violations flagged by axe-core \u2014 audit amber text on dark-green backgrounds and light-gray body text on the cream background, as these are the most likely offenders.",
      "The 'SCROLL' cue is tiny and low-contrast against the hero background; increase its size or replace with a more prominent animated chevron so users know substantial content awaits below the fold.",
      "Add a brief difficulty or fitness-level indicator near the '62 mi per day / 4,800 ft climb' stats \u2014 prospective riders need to self-qualify before registering, and omitting this creates uncertainty.",
      "The 'Register Now' nav CTA and 'Reserve Your Spot' hero CTA both lead to the same action but use different labels \u2014 unify the wording to reduce cognitive friction.",
      "Include a visible price anchor (even a 'From $X' teaser) near the primary CTA; without it, users may hesitate to click because they don't know what commitment they're walking into.",
      "The desktop nav has five links plus a CTA \u2014 ensure a hamburger or equivalent exists for mobile viewports, as the current layout will overflow on small screens."
    ],
    "cost_usd": 1.557
  },
  {
    "iter": 3,
    "sus_score": 87.5,
    "axe_penalty": 12.0,
    "composite_score": 75.5,
    "sus_answers": [
      2,
      1,
      5,
      1,
      4,
      1,
      5,
      1,
      4,
      1
    ],
    "feedback": "Green Mountain Gravel is a well-crafted, single-purpose landing page with strong visual hierarchy, a cohesive green-and-amber design system, and clear CTAs that guide visitors straight to registration. The fixed nav with descriptive section labels, an upfront price anchor, and an urgency signal all reduce friction effectively. Four serious color-contrast violations (flagged by axe-core) affect legibility for some users, and the hero headline 'Two Hundred Miles' appears to conflict with the stats bar showing '62 mi,' which could momentarily undermine trust. Overall this is an above-average marketing page that would perform well in a usability study, but small inconsistencies and accessibility gaps keep it from a top score.",
    "suggestions": [
      "Fix all 4 axe-core color-contrast violations \u2014 likely the small eyebrow/label text and the price-anchor line against the dark hero background; bump type size or darken/lighten the foreground color to achieve 4.5:1 minimum.",
      "Reconcile the headline claim 'Two Hundred Miles' with the stats bar value of '62 mi' \u2014 if 62 mi is per-day average, add a '/day' label; otherwise correct the headline to avoid eroding trust for detail-oriented cyclists.",
      "The stats bar is cut off at the bottom of the viewport in the initial view \u2014 either reduce hero padding so the stats row is fully visible above the fold, or surface the most critical stats (distance, elevation, dates) in the hero body copy itself.",
      "Add social proof (past-rider photos, 1\u20132 short testimonials, or a press mention badge) in the hero or immediately below the fold \u2014 for a ~$449 purchase decision, trust signals near the primary CTA materially increase conversion.",
      "The mobile hamburger menu is present in CSS but hidden on desktop; verify it is fully keyboard- and screen-reader-accessible on small viewports, as the nav links carry the site's entire wayfinding.",
      "Consider adding a sticky 'spots remaining' micro-banner or progress bar once users scroll past the hero, so the scarcity signal stays visible throughout the longer scroll journey to the registration form."
    ],
    "cost_usd": 1.096
  },
  {
    "iter": 4,
    "sus_score": 80.0,
    "axe_penalty": 24.0,
    "composite_score": 56.0,
    "sus_answers": [
      3,
      2,
      4,
      1,
      4,
      2,
      5,
      2,
      4,
      1
    ],
    "feedback": "The landing page is visually polished and well-structured for its narrow purpose \u2014 a single-weekend cycling tour \u2014 with a clear hero, stat bar, and prominent CTA. Navigation is minimal and purposeful, making the site very easy to learn. The primary usability concern is the 8 color-contrast violations (axe-core), most visibly the small amber-on-dark and medium-gray-on-cream text combinations, which reduce readability for users with low vision. Overall this is a competent, above-average marketing page that would benefit from contrast fixes and slightly larger body text before it clears a formal accessibility audit.",
    "suggestions": [
      "Fix the 8 color-contrast violations: the slate-600 (#4b5563) text on cream (#faf8f3) background falls below 4.5:1 \u2014 darken to slate-700 (#374151) or increase font size to \u226518px to meet WCAG AA.",
      "Add a mobile hamburger menu \u2014 the nav-links are only visible on wide viewports; at the screenshot's apparent ~400px width they are hidden with no visible toggle shown.",
      "Increase the stats bar stat-label font size from 0.75rem to at least 0.8125rem (13px) to improve legibility of the uppercase tracking labels ('DAYS', 'ELEV CLIMB', etc.).",
      "The urgency dot animation ('12 spots remaining') draws the eye but the sentence is easy to miss at 0.88rem in the hero. Consider bumping to 0.95rem and adding a subtle highlight background to make the scarcity signal more scannable.",
      "Add visible section landmarks (e.g., <main>, landmark roles) so screen-reader users and keyboard navigators can jump between sections without relying solely on the skip link and nav anchors.",
      "The hero price anchor ('From $449') uses a very small 0.88rem font below the CTA buttons; consider surfacing it more prominently or moving it closer to the primary 'Reserve Your Spot' button to reduce the distance between the value proposition and the price."
    ],
    "cost_usd": 1.583
  }
]