Back to Blog

Foundation Models in production: practical boundaries for on-device AI

How to decide which Foundation Models features belong on device, when to use Private Cloud Compute or a server model, and how to design reliable failure paths.

12 min read

Foundation Models changes what iOS apps can do locally, but it does not remove the need for product judgment.

That is the first production boundary.

The framework gives Apple-platform apps a native Swift API for language tasks, structured output, tool calling, multimodal prompts, and access to different model profiles. On supported devices, the on-device model can handle useful work without a server round trip. In newer releases, Apple is also expanding the framework toward Private Cloud Compute, third-party model providers, dynamic profiles, evaluation tooling, and more agentic app experiences.

That is a large surface area.

The mistake is treating it as one generic “AI feature” bucket. Production apps need a sharper split: what should run on device, what should escalate to a larger model, what should stay deterministic, and what should not ship at all.

1. Start with the product boundary

Do not start with the model.

Start with the user task.

A useful Foundation Models feature should have a clear job:

  • summarize a local note
  • extract structured fields from visible content
  • classify a user-created item
  • rewrite text in a controlled tone
  • answer questions about app-owned data
  • turn a natural-language request into a typed app action

Those are bounded tasks. They have inputs, outputs, and failure modes you can reason about.

The risky features are vague from the start:

  • “make the app smarter”
  • “add an AI assistant”
  • “let users ask anything”
  • “automate workflows”

Those can be valid product directions, but they are not implementation scopes. They are invitations to build a loose chat surface that touches half the app and has no obvious correctness bar.

Before choosing on-device, Private Cloud Compute, or a server model, define four things:

  1. What is the user trying to accomplish?
  2. What data is the model allowed to see?
  3. What output shape does the app need?
  4. What happens when the model is wrong, unavailable, or slow?

If those answers are unclear, the model choice is not the problem yet.

2. Use on-device models for local, bounded work

On-device AI is strongest when the task is close to the user, close to app data, and tolerant of small variation.

Good candidates include:

  • summarizing short local content
  • extracting entities from user text
  • categorizing notes, messages, tasks, or records
  • generating draft copy the user can review
  • producing structured suggestions for app-owned workflows
  • answering constrained questions about a small local context

The main advantages are practical:

  • lower latency for small tasks
  • better privacy posture
  • offline or degraded-network usefulness
  • no per-request backend dependency
  • less infrastructure for simple intelligence features

Those advantages matter, but they do not remove the engineering constraints.

On-device models still have limits:

  • context size is finite
  • model availability depends on supported hardware, OS version, region, language, and Apple Intelligence settings
  • output quality can shift across OS updates
  • larger image or text prompts increase latency
  • reasoning depth is limited compared with larger cloud models

That makes on-device a strong default for narrow features, not a universal replacement for backend intelligence.

A useful rule:

Use on-device when the app can validate the output, recover from mistakes, and keep the interaction useful without pretending the model is authoritative.

3. Escalate when the task needs more context or stronger reasoning

Some features should not stay on device just because the API is convenient.

Escalate to Private Cloud Compute or a server model when the task needs:

  • a larger context window
  • stronger multi-step reasoning
  • access to server-side or cross-device data
  • current external knowledge
  • expensive retrieval or ranking
  • long-running analysis
  • model behavior you need to update independently of the OS

A support assistant inside a finance app is a different problem from extracting a merchant name from a receipt. A project-planning assistant over thousands of records is a different problem from summarizing the current screen.

For production apps, the escalation path should be explicit. The app should know why it is leaving the device boundary.

Examples:

  • The local model can summarize one document; a cloud model handles a workspace-wide synthesis.
  • The local model can classify a transaction; the server handles fraud review or policy decisions.
  • The local model can draft a response; the server model checks it against account-specific business rules.
  • The local model can extract fields from an image; a larger model handles ambiguous cases that require more context.

That split keeps the app honest. On-device handles fast local work. Cloud handles tasks that genuinely need cloud capability.

4. Treat availability as runtime state

Foundation Models features cannot assume a single happy path.

A user may have an unsupported device. Apple Intelligence may be disabled. A region or language may not support the capability. A model profile may not be available. A future OS update may change behavior. Beta APIs may change again before final release.

That means availability belongs in product design, not just error handling.

For every model-backed feature, decide what the user sees in these states:

  • available and fast
  • available but slow
  • temporarily unavailable
  • unsupported on this device
  • disabled because Apple Intelligence is off
  • failed because the output could not be validated
  • failed because the request exceeded practical context limits

Do not bury those cases behind one generic error banner.

Better fallbacks are usually specific:

  • show the manual flow
  • let the user edit the draft directly
  • offer a smaller input range
  • defer the feature until the model is available
  • send the task to the server if the user has allowed that path
  • explain that the feature needs Apple Intelligence enabled

A feature that only works in the perfect configuration is not production-ready. It is a prototype that has not been designed for production.

5. Prefer structured output over prose parsing

If the app needs structure, ask for structure.

Foundation Models supports guided generation for Swift data structures. That matters because many app features do not need a paragraph. They need a typed result the app can inspect.

Examples:

  • a list of extracted dates
  • a suggested category
  • a confidence-bearing classification
  • a proposed route or action
  • a summary with explicit sections
  • a set of editable fields

A weak implementation asks the model for text, then parses the text back into app state.

That is fragile. It turns natural language into an accidental API.

A stronger implementation defines the shape first:

  • what fields are required
  • which values are allowed
  • what can be empty
  • what needs user confirmation
  • what must be rejected if invalid

Then the UI can treat the model output as a proposal, not as truth.

This is especially important when model output drives app behavior. If a model suggests an action, the app should receive a typed action candidate. It should not scrape a sentence and hope the verb was clear.

6. Keep tool calling narrow and auditable

Tool calling is useful because it lets the model work with app capabilities instead of only generating text.

It is also where many teams will create avoidable risk.

A tool should be small, explicit, and boring:

  • search local notes
  • fetch a specific record
  • create a draft task
  • calculate a value
  • look up visible app state
  • propose a route

A tool should not be a vague escape hatch called performUserRequest.

Each tool needs clear rules:

  • what input it accepts
  • what data it can read
  • whether it can mutate state
  • whether it requires confirmation
  • how errors are reported
  • how calls are logged for debugging

For most production apps, model-called tools should default to read-only. Mutating tools need tighter gates:

  • create drafts instead of final records
  • require explicit confirmation before destructive actions
  • show the planned action in normal UI language
  • make the result undoable when possible
  • record enough diagnostic context to debug failures

The model can propose. The app should decide.

That distinction is not philosophical. It is how you avoid a support queue full of “the assistant changed the wrong thing” reports.

7. Design for latency, tokens, and battery

Model-backed features have performance budgets too.

They are just easier to ignore because the work feels abstract.

On-device requests can still affect:

  • launch responsiveness
  • scrolling smoothness
  • memory pressure
  • battery use
  • thermal state
  • perceived reliability

Multimodal prompts make this more obvious. Passing images into a model is powerful, but large images consume more context and increase latency. If the feature only needs text from a label or document, use OCR or a narrower Vision pipeline first. Do not send a full-resolution image into a language model because it made the prototype simpler.

The same applies to text.

A production feature should avoid dumping entire records into the prompt when a smaller representation is enough. Summarize, select, or retrieve only the relevant context. Measure the result.

A reasonable performance checklist:

  • What is the expected response time on the slowest supported device?
  • What is the maximum input size?
  • What happens when the request is canceled?
  • Does the model work run away from the screen that requested it?
  • Are repeated requests deduplicated or cached where appropriate?
  • Is the user blocked, or can they keep working?

If the feature makes the app feel slower, users will not care that the model ran locally. They will just think the app got worse.

8. Build evaluation into the workflow

AI features need tests, but not only normal unit tests.

You need examples.

Keep a small evaluation set for every model-backed behavior:

  • common happy paths
  • short inputs
  • long inputs
  • malformed inputs
  • sensitive edge cases
  • unsupported language or region cases
  • examples where the model should refuse or fall back
  • examples where the app must ask for confirmation

Then run that set when prompts, schemas, tools, model profiles, or OS versions change.

The goal is not to prove the model is perfect. It will not be.

The goal is to catch regressions before they become product behavior.

For structured outputs, evaluate:

  • validity
  • completeness
  • overconfident guesses
  • missing required fields
  • wrong category choices
  • unsafe action proposals

For prose outputs, evaluate:

  • factual grounding
  • tone
  • length
  • whether the user can act on it
  • whether it invents details outside the provided context

Do not ship model-backed behavior that nobody can evaluate. That is not innovation. It is unmeasured product behavior.

9. Use a simple decision matrix

When deciding where a Foundation Models feature belongs, use a simple matrix.

It keeps the decision grounded.

Keep it deterministic

Use normal code when:

  • the rule is known
  • correctness matters more than flexibility
  • the output must be exact
  • the task is already cheap and reliable without a model

Examples: permissions, pricing, eligibility, validation, account state, security decisions.

Use on-device Foundation Models

Use on-device when:

  • the input is local and bounded
  • privacy or offline behavior matters
  • latency needs to be low
  • the app can validate or constrain the result
  • the user remains in control

Examples: summarization, extraction, classification, local draft generation, small workflow suggestions.

Use Private Cloud Compute or a server model

Escalate when:

  • the task needs more context
  • reasoning quality matters more than local execution
  • the model needs current or server-side data
  • output quality must be tuned independently of OS releases
  • the workload is too heavy for the device

Examples: workspace-wide analysis, complex planning, policy-aware generation, support workflows, large-context reasoning.

Do not build it yet

Wait when:

  • the task is vague
  • the failure mode is unacceptable
  • there is no evaluation set
  • the UI cannot explain or recover from mistakes
  • the feature mainly exists because the framework is new

That last one will be common.

New APIs create pressure to ship something visible. Production apps should resist that pressure unless the user benefit is concrete.

10. The production standard is trust

Foundation Models can make iOS apps more capable, especially when the work is local, structured, and integrated into existing app flows.

But the production standard is not whether the feature feels impressive in a demo.

The standard is whether users can trust it:

  • trust what data it uses
  • trust what it can and cannot do
  • trust that important actions require confirmation
  • trust that failures have a clear recovery path
  • trust that the app remains fast and predictable

That is where the architecture matters.

Keep deterministic rules deterministic. Use on-device models for bounded local intelligence. Escalate deliberately when the task needs more than the device can provide. Evaluate the behavior like it is part of the product, because it is.

The good Foundation Models features will not feel like a separate AI layer bolted onto the app. They will feel like the app got better at helping the user finish a specific job.

That is the bar.