Reviewing AI-generated iOS code without lowering the bar

AI-generated iOS code is useful in the same way interns, snippets, and Stack Overflow answers are useful.

It can save time.

It can also confidently hand you a small bag of bugs shaped like productivity.

The mistake is not using AI to write code. That argument is already stale. The mistake is treating generated code as if it arrived from a better process than human code. It did not. It arrived from a faster process.

Speed is not trust.

If a teammate opens a pull request and says “I let the model write most of this,” the review should not become suspicious theater. It should also not become a ceremonial rubber stamp because the diff looks polished and the author feels emotionally attached to the robot.

The standard is simpler:

AI-generated code gets reviewed like any other code, with extra attention to the places where fluent nonsense is expensive.

Those places are not usually formatting.

They are behavior, ownership, lifecycle, tests, and production diagnosis.

1. Start with the product behavior, not the cleverness

Generated code often looks more complete than it is.

It has helpers. It has comments. It has private extension sections arranged with the confidence of someone who has never debugged a background refresh crash at 2 a.m.

That visual polish is dangerous because reviewers are human. We skim tidy code more generously than messy code. A neat implementation can smuggle in the wrong behavior with excellent indentation.

So the first review pass should ignore the implementation style and ask:

What user behavior changed?
What product rule is being encoded?
Which edge cases are supposed to be impossible?
What happens when real device conditions are rude?

For iOS work, “real device conditions” means things like:

the app is backgrounded halfway through the operation
network connectivity changes after the request starts
the user taps twice
permissions are denied, then later granted
the server returns partial data
the task is cancelled
the app launches with stale local state
an older supported OS version takes a different path

AI-generated code is especially good at the happy path because happy paths are overrepresented in examples.

Production is not an example.

If the pull request cannot explain the intended behavior under failure, the code is not ready for review. It is a draft with syntax highlighting.

2. Make the author own the code

The worst phrase in an AI-assisted pull request is:

The model generated this.

That can be useful context. It is not an excuse.

The author still owns the design, the behavior, the tests, and the fallout. If a generated persistence migration corrupts data, the database will not accept “large language model” as a rollback plan.

A practical team rule:

AI may generate code.
The author must be able to explain every line they keep.
Reviewers should reject code the author cannot defend.

This sounds harsh until you consider the alternative: shipping code nobody understands because it arrived quickly and looked plausible.

That is not velocity.

That is technical debt with a better onboarding story.

During review, ask explanation questions when the shape looks borrowed:

Why is this state stored here?
Why does this run on the main actor?
What owns cancellation?
Why is this dependency introduced?
Which test proves this edge case?
What happens if this API returns an empty array instead of an error?

A good author can answer without blaming the tool.

A weak answer is not “AI code is bad.” It is “this code has not been internalized yet.” Different diagnosis, same result: keep reviewing.

3. Treat concurrency as a high-risk area

Swift concurrency is one of the easiest places for generated code to look modern and be wrong.

The model sees async APIs and reaches for Task {} the way some developers reach for Manager: instinctively, often, and with consequences left as an exercise for production.

Review generated concurrency code carefully.

Look for:

unstructured tasks started from views
missing cancellation paths
Task.detached used as compiler repellent
main actor isolation applied too broadly or not at all
mutable shared state hidden behind innocent-looking services
callback bridges that resume continuations incorrectly
async work that updates UI state after the owner should be gone

The dangerous part is not that AI invents new concurrency bugs.

It tends to reproduce the common ones neatly.

For SwiftUI, this matters immediately. A generated view may start async work in onAppear, update @State after navigation, and accidentally run the request multiple times because identity changed. The code compiles. The demo works. The app then behaves like it is haunted by a very punctual ghost.

Review should force an ownership story:

Who starts the work?
Who cancels it?
Which actor owns the mutable state?
Where do errors go?
What prevents duplicate work?
How is stale output ignored?

If those answers are not visible in the code or the PR description, the implementation is unfinished.

Not “AI-ish.”

Unfinished.

4. Be suspicious of generic architecture improvements

AI tools love architecture words.

They will introduce Protocol, Repository, UseCase, Coordinator, Factory, Provider, and Service with the relaxed abundance of a hotel breakfast buffet.

Some of that may be useful.

Much of it is abstraction without pressure.

Generated iOS code should be reviewed for whether the abstractions solve a real problem in this codebase, not whether they look respectable in a blog post.

Good reasons to accept an abstraction:

it isolates a third-party dependency
it separates pure logic from UI state
it makes tests cheaper and clearer
it expresses a real domain boundary
it prevents feature code from knowing too much
it lets one risky decision be changed in one place

Bad reasons:

the model generated it that way
the code “feels cleaner” after adding five files
every service now has a protocol because protocols are adult supervision
a simple feature was inflated to match an architecture diagram nobody will maintain

The review question is not “is this pattern valid?”

Most patterns are valid somewhere. So are umbrellas, but you do not bring one into the shower.

The question is:

Does this abstraction reduce current or near-future risk in this app?

If not, remove it.

Generated code often needs editing down more than building up. The useful reviewer is willing to say: this can be three functions, one type, and no ceremony.

5. Demand tests that match the risk

AI-generated code can produce tests quickly.

It can also produce tests that mainly prove the mock returned what the mock was told to return, which is a charming little puppet show and not much else.

Review tests by risk, not volume.

For iOS code, useful tests usually target:

state transitions
decoding and error mapping
date, locale, and currency edge cases
cancellation behavior
retry and backoff rules
persistence migrations
entitlement and subscription refresh paths
view state for important UI branches

Less useful tests:

verifying that a property assignment assigns the property
snapshotting a screen that changes every time marketing breathes
asserting implementation details that make refactoring expensive
testing generated mocks more thoroughly than production behavior

A model can generate ten tests and still miss the one failure mode that matters.

So the review question should be precise:

Which embarrassing production failure does this test prevent?

If nobody can answer, the test may still be harmless, but it is not evidence.

And if the generated code touches money, identity, data loss, notifications, sync, or app launch, “the model added tests” is not enough. Read them. Break them mentally. Check that they fail for the bug you actually fear.

The point of AI-assisted development is not to produce more green checkmarks.

It is to produce confidence faster without laundering uncertainty through automation.

6. Review API usage against real platform constraints

Generated iOS code often knows the API surface without respecting the platform shape.

That distinction matters.

It may know that background tasks exist. It may not honor their time limits. It may know that Keychain APIs exist. It may not handle migration, access groups, or item update semantics correctly. It may know how to request notification permission. It may not model the denied-then-settings-enabled path.

Review platform code with the boring questions:

Is this API available on every OS version we support?
Does this require an entitlement, capability, or Info.plist key?
Does it behave differently on device versus simulator?
Does it need main-thread access?
Does it survive app relaunch?
Does it fail silently under privacy restrictions?
Does this code assume the user grants permission?

AI code is often shaped by examples where the permission is granted, the simulator is friendly, and the app lifecycle is a straight line.

iOS is none of those things.

If generated code touches platform boundaries, reviewers should slow down. That is where plausible code becomes expensive code.

7. Watch for invented consistency

One subtle failure mode of AI-generated code is fake consistency.

The code may mimic names, folder structure, and patterns from nearby files well enough to look native to the project. That is useful when correct. It is dangerous when it copies the shape but not the intent.

Examples:

using the same suffix as existing types but changing ownership semantics
copying an error type pattern but losing important context
matching a view model structure while bypassing established state transitions
using the same dependency container but resolving work from the wrong layer
following an old pattern the team is actively migrating away from

Reviewers should not merely ask whether the code matches the surrounding style.

Ask whether it matches the surrounding contracts.

That means checking boundaries:

Does feature code still depend only on allowed modules?
Are domain models still separated from transport DTOs?
Are analytics and logging handled through the existing path?
Are errors mapped with the same user-facing semantics?
Does this respect the current migration direction?

Style mimicry is cheap.

Architectural consistency is not.

Generated code can do the first while quietly violating the second.

8. Require operational evidence for production paths

A lot of generated code stops at “works locally.”

That is not enough for paths that will fail in production and need diagnosis.

For important iOS flows, review should check observability:

Are meaningful errors logged?
Are privacy boundaries respected?
Can support distinguish user cancellation from server failure?
Are retries visible enough to debug?
Are analytics events stable and intentionally named?
Is there enough context to diagnose without dumping personal data?

AI-generated code may include print(error) and consider the matter closed.

That is not observability. That is a confession whispered into the simulator.

The right level depends on the feature. A tiny UI tweak does not need a dashboard. A sync engine, purchase flow, onboarding funnel, or authentication path needs enough signal that the team can understand failures after release.

Reviewers should ask for evidence, not vibes.

If the PR changes a production-critical path, it should say how the team will know whether it worked.

9. Keep the review comments sharp

There is a lazy version of reviewing AI-generated code where every comment becomes moral commentary:

“This feels AI-generated.”
“Did you even write this?”
“AI code is always messy.”

That is not review.

That is a personality disorder with GitHub access.

Good review comments should name the risk and the required change:

This starts a new task every time the view appears, but nothing cancels older requests or ignores stale responses. Please move ownership into the model, cancel the previous task before starting a new one, and add a test for stale response handling.

Or:

This repository abstraction only has one implementation and mostly forwards calls to the client. It adds a layer without isolating risk. Please inline it unless there is a second caller or a specific test boundary we need.

Or:

The test asserts the mock response is returned, but the risky behavior is decoding a partial payload and mapping it to view state. Please test that path instead.

Notice what those comments do not mention: the model.

They review the code.

That keeps the bar high without turning the process into a culture war.

10. Use AI to improve the review, not replace it

AI can help reviewers too.

It can summarize a large diff, suggest edge cases, compare the PR against a checklist, or generate targeted test ideas. That is useful. Use it.

But do not outsource judgment.

A model can say “this looks good” with the same confidence it uses to invent an API that was deprecated during the Obama administration. It does not know your production incidents, your least reliable backend endpoint, your history of flaky UI tests, or the one subscription edge case that cost the team a weekend.

A good review process uses AI as a second pass, not the accountable reviewer.

Practical use:

ask it to list lifecycle edge cases
ask it to identify unstructured concurrency
ask it to propose tests for failure paths
ask it to summarize architecture changes
ask it to find duplicated logic in the diff

Then a human decides what matters.

The reviewer owns the approval.

Not the tool.

11. The bar does not move

AI changes the cost of producing code.

It does not change the standard for accepting it.

That is the whole policy.

If generated code is correct, understandable, tested, well-owned, and consistent with the app’s architecture, ship it. Enjoy the saved time. Pretend to be surprised that a tool was useful.

If it is fluent but wrong, reject it.

If it is over-abstracted, cut it down.

If the author cannot explain it, send it back.

If the tests are decorative, ask for evidence.

If the concurrency story is vague, stop the line.

The team does not need a special lower bar for AI-generated code. It needs the normal bar applied with less politeness toward plausible nonsense.

That is not anti-AI.

That is engineering.

The robot may type faster.

It still does not get commit rights to your production standards.