AI-generated React code needs an inverted test pyramid: strict type contracts on top, behaviour tests in the middle, and a thin end-to-end pass at the point. The failure mode that breaks teams is letting the same prompt write both the implementation and the tests, which produces code that agrees with itself but not with reality. The stack assumed throughout is React 19, Vitest 4.x, React Testing Library 16.x, Playwright 1.6x, TypeScript 6.x, and Zod 4.x.
TL;DR, the inverted testing pyramid for AI-generated code
Flip the classic pyramid. Put TypeScript and Zod contracts at the wide top, behaviour-level Testing Library tests in the middle, and a small number of Playwright happy-paths at the point. Write the contracts and the end-to-end flow yourself before prompting, then ask the AI to make those failing tests pass. Stop tracking coverage percentage and start tracking defect-escape rate per release. The order matters because the AI has a verifiable target instead of inventing both the code and its own definition of success.
diagram: the inverted pyramid for AI-generated code
[ TypeScript strict + Zod contracts ] <- you write these first
[ RTL behaviour tests ] <- you review every assertion
[ Playwright E2E ] <- one happy-path per critical flow
v
AI fills in the implementation underneath
The single line to remember: contracts before implementation, never both in the same prompt.
Why AI-generated tests do not verify AI-generated code
A test only verifies code when it was written against a different mental model than the code under test. AI prompts that produce both pieces in one shot break that property. The model hallucinates a hook signature, then hallucinates a mock that matches the hallucination, and both halves of the conversation agree with each other rather than with the real API. The test passes. The code ships. Production breaks.
Here is the failure mode in two snippets. Suppose we ask an AI to "write a useUserProfile hook and a test for it". The hook the AI produces looks reasonable:
src/features/users/useUserProfile.ts
import { useEffect, useState } from 'react'
type Profile = {
id: string
displayName: string
email: string
}
export function useUserProfile(userId: string) {
const [profile, setProfile] = useState<Profile | null>(null)
useEffect(() => {
fetch(`/api/users/${userId}`)
.then((response) => response.json())
.then((payload) => setProfile(payload.user))
}, [userId])
return profile
}
The hook reads payload.user. Its matching test mocks the same shape:
src/features/users/useUserProfile.test.ts
import { renderHook, waitFor } from '@testing-library/react'
import { afterEach, beforeEach, expect, test, vi } from 'vitest'
import { useUserProfile } from './useUserProfile'
beforeEach(() => {
vi.stubGlobal(
'fetch',
vi.fn().mockResolvedValue({
json: async () => ({ user: { id: '1', displayName: 'Ada', email: 'ada@example.com' } }),
}),
)
})
afterEach(() => {
vi.unstubAllGlobals()
})
test('returns the user profile', async () => {
const { result } = renderHook(() => useUserProfile('1'))
await waitFor(() => expect(result.current?.displayName).toBe('Ada'))
})
Green tick. Ship it. Now look at the real response from the API:
actual response body from /api/users/:id
{ "data": { "user": { "id": "1", "displayName": "Ada", "email": "ada@example.com" } } }
The server returns data.user, not user. Our hook reads the wrong key. A component using it renders null forever. No test caught the bug because the test was written by the same prompt that invented the wrong shape. Two wrongs agreed and the suite reported success.
The extractable rule is short. Tests written by the same context as the code do not verify the code. They verify the consistency of the context.
Layer 1, the type contract
Shape drift is the cheapest bug to catch and the most expensive to debug later. A misshapen object that slips past the compiler ends up as a undefined is not a function three layers deep at runtime, often weeks after the AI introduced it. The first layer of the inverted pyramid stops shape drift before tests even run, by making the compiler and a runtime validator do work that no behaviour test should have to repeat.
TypeScript strict mode as a first-line filter
The TypeScript flags that matter for AI-generated code live in tsconfig.json. Three of them make the biggest difference: strict, noUncheckedIndexedAccess, and exactOptionalPropertyTypes. Without them, the AI will happily index into possibly-undefined arrays, pass through partial objects as if they were complete, and mark optional fields in ways the runtime does not agree with.
tsconfig.json
{
"compilerOptions": {
"target": "ES2022",
"module": "ESNext",
"moduleResolution": "bundler",
"jsx": "react-jsx",
"strict": true,
"noUncheckedIndexedAccess": true,
"exactOptionalPropertyTypes": true,
"noImplicitOverride": true,
"noFallthroughCasesInSwitch": true,
"isolatedModules": true,
"skipLibCheck": true
}
}
The full list of strict flags is documented on the TypeScript strict mode page. Why these matter for AI-generated code is specific. AI tends to write code that "looks plausible" by averaging across patterns it has seen. Averaged code is exactly what loose TypeScript lets through. With noUncheckedIndexedAccess, the AI cannot write users[0].email without acknowledging that users[0] might be undefined, because the compiler will flag it. Turn on exactOptionalPropertyTypes and the AI can no longer pass { email: undefined } to a function that expects { email?: string }, because those are no longer the same type. The compiler becomes a second reviewer that does not get tired.
The extractable summary: strict TypeScript is the cheapest test you will ever write because you did not write it.
Zod or Valibot schemas at the API boundary
TypeScript stops at the compile step. The moment data crosses the wire, types are erased and the AI's optimistic assumption about the response shape meets reality. This is where a runtime schema catches what the compiler cannot. We use Zod 4.x to parse the response at the boundary, then hand the parsed object to the rest of the app.
src/api/users/getUser.ts
import { z } from 'zod'
const UserSchema = z.object({
id: z.string(),
displayName: z.string(),
email: z.string().email(),
})
const GetUserResponseSchema = z.object({
data: z.object({
user: UserSchema,
}),
})
export type User = z.infer<typeof UserSchema>
export async function getUser(userId: string): Promise<User> {
const response = await fetch(`/api/users/${userId}`)
if (!response.ok) {
throw new Error(`getUser failed: ${response.status}`)
}
const json = await response.json()
const parsed = GetUserResponseSchema.parse(json)
return parsed.data.user
}
This is the same failure as the hook example above, fixed at the boundary. If the AI implementation drifts and starts reading payload.user directly, GetUserResponseSchema.parse throws at runtime with a precise message naming the missing path. The test for getUser does not need to know the inner shape. It only needs to assert that valid payloads round-trip and invalid ones throw. The schema is doing the work that fifty unit tests would otherwise approximate.
Teams that care about bundle size can use Valibot in the same role with a smaller footprint. The pattern is identical: parse at the boundary, infer the type from the schema, let the schema be the single source of truth for the shape.
When we are using a data-fetching library that already gives us a typed cache, the schema sits one layer below it. If you have not yet picked between Redux Toolkit Query and TanStack Query for that layer, the comparison in RTK Query vs TanStack Query walks through the trade-offs.
The extractable summary: TypeScript covers what the AI declares, Zod covers what the AI actually receives.
Layer 2, behaviour tests with Testing Library
Behaviour tests are where most AI-generated test suites fail to add value. The AI gravitates towards implementation-detail assertions because they are easy to write and they pass. Mock call counts, internal state, hook return shapes in isolation. None of these survive a refactor, and none of them catch the bug where the implementation works but the wiring is wrong. The middle layer of the pyramid uses Vitest 4.x with React Testing Library 16.x and asserts what the user sees, not what the code does.
What to assert, what to skip
Two assertions, same component, very different value. The first is what AI tends to produce when asked for "a test for the login form":
src/features/auth/LoginForm.lowsignal.test.tsx
import { render, screen } from '@testing-library/react'
import userEvent from '@testing-library/user-event'
import { expect, test, vi } from 'vitest'
import { LoginForm } from './LoginForm'
test('calls the submit handler', async () => {
const onSubmit = vi.fn()
const user = userEvent.setup()
render(<LoginForm onSubmit={onSubmit} />)
await user.type(screen.getByLabelText('Email'), 'ada@example.com')
await user.type(screen.getByLabelText('Password'), 'password123')
await user.click(screen.getByRole('button', { name: 'Sign in' }))
expect(onSubmit).toHaveBeenCalledWith({
email: 'ada@example.com',
password: 'password123',
})
})
This test passes whether the form is accessible or not, whether errors are shown or not, and whether the button is disabled during submission or not. It only checks that an internal function got called with the values it was given. Refactor onSubmit to take a single object with a different key, and the test breaks for a reason unrelated to user experience.
The high-signal version asserts what a person interacting with the form would observe:
src/features/auth/LoginForm.test.tsx
import { render, screen } from '@testing-library/react'
import userEvent from '@testing-library/user-event'
import { expect, test } from 'vitest'
import { LoginForm } from './LoginForm'
test('shows a validation error when email is missing', async () => {
const user = userEvent.setup()
render(<LoginForm onSubmit={() => {}} />)
await user.click(screen.getByRole('button', { name: 'Sign in' }))
expect(await screen.findByRole('alert')).toHaveTextContent('Email is required')
})
test('disables the submit button while the request is in flight', async () => {
const user = userEvent.setup()
let resolve: (() => void) | undefined
const onSubmit = () => new Promise<void>((r) => { resolve = r })
render(<LoginForm onSubmit={onSubmit} />)
await user.type(screen.getByLabelText('Email'), 'ada@example.com')
await user.type(screen.getByLabelText('Password'), 'password123')
await user.click(screen.getByRole('button', { name: 'Sign in' }))
expect(screen.getByRole('button', { name: 'Sign in' })).toBeDisabled()
resolve?.()
})
These tests survive a refactor of the submit handler, of the form library, and of the internal state representation. They break only when the user-visible behaviour changes, which is exactly when we want a test to break. The reason they catch AI drift better is that the AI tends to refactor implementation while preserving its own internal interfaces. Implementation-detail tests applaud that refactor. Behaviour tests notice when the visible error message disappears or the button stops disabling itself during submission.
The extractable summary: assert what the user can see, query, or interact with, never the internal function calls the AI invented to get there.
When component tests beat unit tests for AI code
A unit test calls a function in isolation. A component test renders the component and exercises the wiring between the hook, the view, and any context providers. For AI-generated code, the gap that matters most lives in that wiring. The AI is usually correct at the level of a single function and wrong at the level of how that function is plumbed into the rendered tree.
Consider a useUsers hook that returns { data, isLoading, error } and a UserList component that renders the data. The AI might implement both correctly in isolation and still wire them up so that isLoading is never read, leaving the user staring at an empty screen while the request is in flight. A unit test on useUsers returns the correct shape. A unit test on UserList with a fake data prop renders correctly. The bug lives in the render output during the loading state, which only a component test exercising the full render will catch.
This is also where async state libraries earn their place. If you are using TanStack Query to manage that state and want patterns for testing components that depend on it, TanStack Query is not just for API requests covers the broader pattern of treating it as an async state container rather than only a fetcher.
The extractable summary: AI breaks the wiring more often than the units, so test the wiring.
Layer 3, the thin end-to-end pass
End-to-end tests are slow, flaky, and expensive to maintain. For AI-generated code we run very few of them, and the ones we keep are the ones we own ourselves. Playwright 1.6x is the assumed runner. The job of this layer is not coverage. The job is to catch the case where every contract test and every behaviour test passed and the user still cannot complete the flow.
One Playwright happy-path per critical flow
A critical flow is anything that touches money, identity, or data loss. Sign in, sign up, checkout, password reset, file upload, account deletion. For each one, we maintain a single Playwright spec that walks the happy path, and we write that spec before the AI writes the implementation underneath. The spec is the contract for the flow.
tests/e2e/sign-in.spec.ts
import { expect, test } from '@playwright/test'
test('signs the user in and lands on the dashboard', async ({ page }) => {
await page.goto('/sign-in')
await page.getByLabel('Email').fill('ada@example.com')
await page.getByLabel('Password').fill('correct horse battery staple')
await page.getByRole('button', { name: 'Sign in' }).click()
await expect(page).toHaveURL('/dashboard')
await expect(page.getByRole('heading', { name: 'Welcome, Ada' })).toBeVisible()
})
Five interactions, two assertions, no implementation knowledge. The spec does not know whether the form uses controlled state or react-hook-form, whether the network call uses fetch or Axios, whether the redirect uses useNavigate or a server-side response. It knows that a real user types two fields, clicks a button, and ends up on the dashboard with their name visible. If any of those break, no amount of green unit tests will save the release.
The order matters. We write this spec first, then we hand it to the AI alongside the failing tests and ask for the implementation. The AI now has an executable definition of "done" that it did not invent. Ownership of the spec stays with us, in code review, and the AI never edits it without a human deciding to allow that edit.
The extractable summary: own the happy-path spec for every flow you cannot afford to break.
Visual regression as a hallucination detector
The other thing Playwright catches well is the AI silently moving the user interface. A common failure is the AI helpfully "improving" a layout while editing nearby code, dropping a CTA, renaming a label, or removing a section the test suite did not name. Screenshot diffing catches these because the baseline image is the ground truth.
tests/e2e/dashboard.visual.spec.ts
import { expect, test } from '@playwright/test'
test('dashboard matches the visual baseline', async ({ page }) => {
await page.goto('/dashboard')
await page.waitForLoadState('networkidle')
await expect(page).toHaveScreenshot('dashboard.png', {
maxDiffPixelRatio: 0.01,
})
})
The trade-off is real. Every intentional UI change needs a baseline refresh, which means the visual suite has a maintenance cost that grows with how often the design changes. The way to keep the cost manageable is to limit visual specs to surfaces that should change rarely: marketing pages, sign-in, the empty state of the main app shell. Do not screenshot the whole product. Screenshot the parts you want to lock down.
The extractable summary: a screenshot baseline is the cheapest way to catch the AI rewriting parts of the UI you never asked it to touch.
The prompt pattern, test first then implementation
This is the operational core of the strategy. Most teams using AI for React code feed the AI a feature request and get back a component plus its tests in the same response. That is the failure mode the rest of the article exists to prevent. The fix is a two-step prompt where the test is written or reviewed by a human first, and the implementation is generated against the test as a target.
The prompt you paste into Cursor, Claude Code, or whichever assistant you use looks like this:
prompt template, paste into Cursor or Claude Code
Below is a Zod schema and a failing Vitest test for a new component called UserProfileCard.
Do not modify the schema. Do not modify the test.
Implement src/features/users/UserProfileCard.tsx so that the test passes.
Use only the data shapes defined by the schema.
If the test cannot be made to pass without changing it, stop and explain what you need.
--- src/features/users/userSchema.ts ---
import { z } from 'zod'
export const UserSchema = z.object({
id: z.string(),
displayName: z.string(),
email: z.string().email(),
avatarUrl: z.string().url().nullable(),
})
export type User = z.infer<typeof UserSchema>
--- src/features/users/UserProfileCard.test.tsx ---
import { render, screen } from '@testing-library/react'
import { expect, test } from 'vitest'
import { UserProfileCard } from './UserProfileCard'
const user = {
id: '1',
displayName: 'Ada Lovelace',
email: 'ada@example.com',
avatarUrl: null,
}
test('shows the display name and email', () => {
render(<UserProfileCard user={user} />)
expect(screen.getByRole('heading', { name: 'Ada Lovelace' })).toBeVisible()
expect(screen.getByText('ada@example.com')).toBeVisible()
})
test('shows initials when there is no avatar', () => {
render(<UserProfileCard user={user} />)
expect(screen.getByLabelText('Avatar for Ada Lovelace')).toHaveTextContent('AL')
})
The structure does three things. Locking the schema means the AI cannot drift the shape. Locking the test means the AI cannot redefine success. A stop condition ("if the test cannot be made to pass without changing it, stop and explain") gives the AI an honest escape hatch instead of silently editing the test to match a broken implementation.
The same pattern works inside Claude Code or any agentic CLI by adding the test file to the context before the implementation file is touched. For onboarding a new contributor to this workflow, the broader pattern of feeding the AI verified artefacts first is covered in onboarding to a new codebase with AI tools.
The rule that compounds: never ask the AI to write the implementation and its tests in the same prompt. The test is upstream of the code, written or reviewed by a human, and the AI's job is to satisfy it.
The extractable summary: lock the contract, lock the test, let the AI fill in the middle.
CI gates and coverage numbers that actually mean something
The anti-pattern is the 95% coverage badge on a repository that still ships regressions every Friday. AI is excellent at writing tests that inflate the percentage without verifying anything meaningful. It mocks the things that should not be mocked, asserts the return value of vi.fn(), and snapshots the JSX of components whose visual output it has never run. Coverage goes up. Bugs do too.
The three CI gates that earn their keep are short. Type check passes. Contract tests pass. The critical Playwright happy-path passes. Everything else is a signal, not a gate.
.github/workflows/ci.yml (excerpt)
jobs:
verify:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pnpm/action-setup@v4
- uses: actions/setup-node@v4
with:
node-version: '22'
cache: 'pnpm'
- run: pnpm install --frozen-lockfile
- run: pnpm typecheck
- run: pnpm test:contracts
- run: pnpm test:behaviour
- run: pnpm exec playwright install --with-deps chromium
- run: pnpm test:e2e --grep @critical
The split between test:contracts, test:behaviour, and test:e2e --grep @critical is deliberate. Contract tests run first because they are fastest and catch the largest class of AI drift. Behaviour tests run second because they catch wiring bugs that the compiler cannot see. Playwright runs filtered by a @critical tag so the gate only blocks on the flows you actually cannot afford to break. Everything else in the Playwright suite can run nightly.
Replace the coverage badge with two metrics that map to the thing you care about. Defect-escape rate per release, defined as the number of bugs reported in production within seven days of a release divided by the number of releases. The share of those defects caught by any test layer before the release. A team that reduces defect-escape rate from 12% to 3% has shipped a better suite. A team that lifts coverage from 70% to 95% has shipped more tests.
The strategy above is what an individual developer can adopt this week. Scaling it across a team takes process: who writes the contract tests, who reviews the AI's implementation, what gets generated and what stays handwritten, how the metrics flow back into prompt templates. The full team workflow, with role splits and the CI metric pipeline, is covered in the upcoming Vibe Code to Production book.
The extractable summary: the coverage number is a vanity metric for AI-generated code, the escape rate is the one to gate releases on.
The compounding change is the prompt pattern. Lock the contract, write the failing test, ask the AI for the implementation, and review what it produced against a target it did not invent. Every other layer in this pyramid gets cheaper once that habit is in place.


