What I've Learned Using Codex for Front-End Work, Acceptance Criteria, and Design

I have been using Codex since last October, starting with the CLI and staying with it through the App. I have seen it grow in real time. Most of the time I use it for iOS development, improving my own workflow, and building small tools for myself.

These are a few things I have learned so far. If you use Codex differently, I would be curious to compare notes.

A hand-drawn visual summary of Codex workflows for front-end design, acceptance criteria, and avoiding over-engineering.

Front-end: reference beats taste

As things stand, GPT models still do a bad job at making front-end work look good on their own. The taste is rough. Whether you ask for an iOS UI or a web interface, the result usually has obvious flaws.

The official front-end skill does help, but only to a point.

What matters is separating two different weaknesses. GPT’s problem is not visual understanding. Its problem is aesthetic judgment. It does not really know what looks good, but it usually does understand what it is looking at.

That means once you give it a clear reference, the odds improve a lot. If you first draft something in Stitch or Figma, or even sketch it by hand, Codex can often reproduce it surprisingly well.

The acceptance target also matters. The target should be visual consistency, not code consistency. If you define success as code consistency, GPT will focus on logic and structure. It will decide the code is “close enough” while the interface still looks wrong.

The better loop is to make it take screenshots, compare them against the draft, and then fix the gaps. You need a visual feedback loop if you want a high-quality reproduction.

Acceptance criteria: you still have to own them

You do not need to understand every implementation detail yourself. But if you do not understand the acceptance criteria, things fail very easily.

It is like renovating an apartment. You can tell the contractor you want a Scandinavian interior, but if you cannot tell what that style actually looks like, the contractor will probably improvise, and not in a way you like.

GPT already knows how to do TDD. It can write tests on its own. The real question is whether your task can be verified mechanically in the first place.

Most AI-generated acceptance criteria default to code-based checks. That is fine when the goal itself is concrete. But if the goal is abstract, then you still have to design the acceptance criteria yourself.

UI work is the obvious example. By default, GPT will write a few UI tests. It usually will not volunteer screenshot comparison as the core validation method.

The same applies to more abstract workflows. If you design a skill and want to know whether its workflow matches your intent, you can ask GPT to spin up a subagent to describe that skill back to you. If the description comes back wrong, you keep refining it. That is also a form of acceptance testing, just at a more abstract level.

Design: watch for over-engineering

Using an agent to help with design is good. But Codex sometimes over-designs the solution in a very obvious way.

Say you have a batch of customer comments and you want to identify the negative ones worth attention. You want a workflow where an agent helps you filter what actually needs follow-up.

If you leave everything to Codex, it may build something far too complete: more rules, more scripts, more layers than you actually need. And the more rules you add, the higher the maintenance cost gets. Very quickly you slide into rule hell.

In reality, the need may be much simpler. Maybe all you really need is to fan the task out to several agents and let them judge the comments in parallel.

So you still need to set the boundary yourself. Do not accept every layer of complexity just because GPT is eager to be helpful. The model’s instinct is to satisfy the request as fully as possible. Your job is to think about maintenance cost, complexity, and what the real requirement actually is.

These are the main things on my mind for now. I will probably write more once I have another round of scars from using it.