Feature flags: safe rollouts without shipping fear

Feature flags are useful right up until they become your second configuration system.

That usually happens in the same predictable way:

flags start as a safe rollout tool
then product wants experiments
then support needs kill switches
then engineering adds local overrides
then nobody can explain why a screen behaves differently for two users on the same build

The problem is not feature flags themselves. The problem is pretending all flags are the same.

If you want safer releases without turning the app into a haunted house, you need a small system with hard rules.

1. Start by splitting flags into categories

The first mistake is one giant FeatureFlags bag with thirty booleans in it.

Not all flags serve the same purpose. Treat them differently.

A practical split is:

Release flags: gradually expose already-built functionality.
Experiment flags: bucket users into variants and measure results.
Operational flags: kill switches and degradation controls for incidents.
Developer flags: local-only overrides for testing and QA.

Those categories should have different expectations.

For example:

release flags should have an owner and an expiry date
experiment flags should have variant definitions and analytics wiring
operational flags should default to the safest behavior if config fetch fails
developer flags should never be read from production backend config

If you mix these together, you get terrible decisions like using an A/B test flag as an emergency outage switch. That works exactly once.

2. Keep flag evaluation boring and deterministic

A flag system should not feel clever. It should be easy to answer:

what value did this user get?
why did they get it?
when can this flag be deleted?

A simple model is enough:

import Foundation

enum FlagKey: String, CaseIterable, Sendable {
    case paywallV2
    case onboardingExperiment
    case disableImageUploads
}

enum FlagValue: Sendable, Equatable {
    case bool(Bool)
    case string(String)
    case int(Int)
}

struct FlagSnapshot: Sendable {
    let values: [FlagKey: FlagValue]
    let fetchedAt: Date
    let source: Source

    enum Source: Sendable {
        case bundledDefaults
        case remoteConfig
        case localOverride
    }
}

And an evaluator that is explicit about precedence:

struct FlagEvaluator: Sendable {
    let localOverrides: [FlagKey: FlagValue]
    let remoteValues: [FlagKey: FlagValue]
    let defaults: [FlagKey: FlagValue]

    func value(for key: FlagKey) -> FlagValue {
        if let local = localOverrides[key] { return local }
        if let remote = remoteValues[key] { return remote }
        return defaults[key] ?? .bool(false)
    }

    func isEnabled(_ key: FlagKey) -> Bool {
        guard case .bool(let enabled) = value(for: key) else { return false }
        return enabled
    }
}

The interesting part is not the code. The interesting part is the contract:

local override wins for internal testing
remote config wins over bundled defaults
defaults are safe and shippable
evaluation does not depend on mutable view state

If the answer can change halfway through a screen because three objects re-read config differently, you are not doing flags. You are doing runtime chaos.

3. Use typed access, not string soup

Stringly flags feel fast until someone renames new_checkout_flow to checkout_v2, misses one usage, and quietly ships dead logic.

Typed keys are not glamorous, but they prevent a lot of accidental mess.

Bad:

if remoteConfig.bool(forKey: "new_checkout_flow") {
    // ...
}

Better:

if flags.isEnabled(.paywallV2) {
    // ...
}

That buys you a few important things:

one place to inventory every active flag
compile-time help during cleanup
easier code search when the rollout is done
a better chance of noticing duplicated flags with slightly different names

Feature flags already increase branching. Do not also make the names sloppy.

4. Separate fetch from evaluation

Another common mistake: UI code asks the remote config SDK directly whether something is enabled.

That leaks infrastructure decisions into product code and makes testing annoying.

Instead, split the system into three parts:

Provider fetches raw config.
Store persists the latest snapshot.
Evaluator answers product-level questions.

Something like this is plenty:

protocol FlagProvider: Sendable {
    func fetch() async throws -> [FlagKey: FlagValue]
}

protocol FlagStore: Sendable {
    func load() async -> FlagSnapshot?
    func save(_ snapshot: FlagSnapshot) async
}

actor FlagsService {
    private let provider: FlagProvider
    private let store: FlagStore
    private let defaults: [FlagKey: FlagValue]
    private var snapshot: FlagSnapshot

    init(
        provider: FlagProvider,
        store: FlagStore,
        defaults: [FlagKey: FlagValue]
    ) {
        self.provider = provider
        self.store = store
        self.defaults = defaults
        self.snapshot = FlagSnapshot(values: defaults, fetchedAt: .distantPast, source: .bundledDefaults)
    }

    func refresh() async {
        do {
            let values = try await provider.fetch()
            let snapshot = FlagSnapshot(values: values, fetchedAt: .now, source: .remoteConfig)
            self.snapshot = snapshot
            await store.save(snapshot)
        } catch {
            // Keep the last known good snapshot. Reliability beats drama.
        }
    }

    func evaluator(localOverrides: [FlagKey: FlagValue] = [:]) -> FlagEvaluator {
        FlagEvaluator(
            localOverrides: localOverrides,
            remoteValues: snapshot.values,
            defaults: defaults
        )
    }
}

Now the rest of the app depends on FlagEvaluator, not on your config vendor.

That matters more than it sounds. Vendors change. Your app architecture should not.

5. Design for stale config, because it will be stale

A lot of teams talk about “real-time feature flags” as if every app session has perfect network and instant config delivery.

On mobile, assume this instead:

the app may launch offline
config may fail to fetch
the user may keep the app installed for months without updating
the last good snapshot might be hours or days old

So every important flag needs a stale-config policy.

Release flags

Usually safe to use the last known value, provided the default also produces a coherent product.

Experiment flags

Need stable bucketing. A user should not bounce between A and B because config refresh timing changed.

Operational flags

Need explicit safe behavior.

For example, if uploads are causing crashes in one subsystem, this is a good operational flag:

remote value true → disable uploads
default value false → feature works when there is no incident
incident runbook explains when to flip it and what user impact to expect

What you do not want is an operational flag whose safe value is unclear because half the app assumes the opposite default.

6. Keep bucketing stable or your experiments are fiction

If you run experiments, assign variants from a stable identifier and persist the result.

Do not:

bucket from something ephemeral like session start
rebucket every launch
let local time or fetch timing affect the variant

A minimal deterministic bucketing approach:

import Foundation
import CryptoKit

func bucket(userID: String, experiment: String, modulo: Int = 100) -> Int {
    let input = Data("\(experiment):\(userID)".utf8)
    let digest = SHA256.hash(data: input)
    let value = digest.prefix(8).reduce(0 as UInt64) { partial, byte in
        (partial << 8) | UInt64(byte)
    }
    return Int(value % UInt64(modulo))
}

func variant(for userID: String) -> String {
    bucket(userID: userID, experiment: "onboardingExperiment") < 50 ? "control" : "treatment"
}

It does not need to be complicated. It needs to be stable.

If your experiment assignment is unstable, your analytics are theater with charts.

7. Put kill switches on the edges, not in every view

Operational flags should degrade systems, not infect every screen with incident logic.

For example, imagine image uploads are causing memory spikes. Do this:

struct UploadPolicy: Sendable {
    let uploadsEnabled: Bool
    let maxConcurrentUploads: Int
}

final class UploadCoordinator {
    private let policy: UploadPolicy

    init(policy: UploadPolicy) {
        self.policy = policy
    }

    func enqueue(_ item: PendingUpload) throws {
        guard policy.uploadsEnabled else {
            throw UploadError.temporarilyDisabled
        }

        // normal queueing flow
    }
}

Do not do this in fifteen places:

hide one button here
show one warning there
special-case retries elsewhere
bypass the queue in another code path because “that screen is different”

If a flag changes system behavior, put it near the system boundary.

That gives you one place to test and one place to reason about during an incident.

8. Add observability or you are guessing

You need to be able to answer these questions without archaeology:

Which flags were active for this user?
Which snapshot version did the app evaluate?
Was the value remote, default, or local override?
When was the snapshot fetched?

You do not need to log every flag on every tap. You do need enough context in analytics and bug reports to reconstruct behavior.

A practical setup:

include key rollout flags in screen-view and conversion events
attach snapshot metadata to debug logs
expose active flags in an internal debug screen
include flag state in support-exported diagnostics if privacy allows it

The debug screen is boring, and that is precisely why it is valuable.

When QA says “I can’t reproduce it,” a flag panel usually explains why faster than another meeting will.

9. Write deletion into the process

Feature flags are a temporary safety tool.

Teams get into trouble when they treat them as permanent architecture.

Every new flag should have:

an owner
a reason it exists
a creation date
a delete condition
a delete-by date if possible

I like a tiny metadata structure for this:

struct FlagMetadata: Sendable {
    let owner: String
    let category: Category
    let createdAt: Date
    let deleteWhen: String

    enum Category: String, Sendable {
        case release
        case experiment
        case operational
        case developer
    }
}

This can live alongside the typed flag definitions or in a simple internal document. The exact format matters less than forcing the question.

If nobody knows when a flag should die, it will survive long enough to distort the codebase.

The cleanup rule worth enforcing

When a rollout reaches 100% and the feature is healthy, the next task is not “celebrate.”

The next task is:

delete the old path
delete the flag
delete the analytics branch that only existed for rollout comparison
delete the local override if it is no longer useful

Otherwise “safe rollout” turns into “permanent branch complexity with a nice origin story.”

10. Keep local overrides behind a deliberate debug surface

Developers and QA need local overrides. Production users do not.

A decent setup usually means:

debug-only flag menu
optional persistence for local testing
clear reset button
obvious label showing an override is active

The important bit is social, not technical: local overrides should be visible.

Nothing wastes time like debugging a “production issue” that only exists because someone left paywallV2 = true in a hidden settings screen three days ago.

11. The architecture that usually holds up

If I had to keep it compact, I would use this shape:

Bundled defaults shipped with the app
Remote snapshot fetched on launch and occasionally refreshed
Typed evaluator used by product code
Debug overrides layered only in internal builds
Metadata + cleanup policy to keep the system small

That is enough for most iOS teams.

You do not need a feature-management platform before you have feature-management discipline.

12. A blunt checklist before adding a new flag

Before introducing a flag, ask:

Is this a release flag, experiment, operational switch, or developer override?
What is the safe default if config never arrives?
Who owns it?
How will I debug its current value on a real device?
When does it get deleted?
Can the branch live at a system boundary instead of leaking through the UI?

If you cannot answer those quickly, the flag is not ready.

Final take

Feature flags are worth having because they reduce release risk.

They stop being worth having when they quietly become unmanaged branching infrastructure.

Keep the categories explicit. Keep evaluation deterministic. Put kill switches at boundaries. Delete flags aggressively.

That is how you get the upside: safer rollouts, calmer incidents, and fewer Friday-night “why is this user seeing that screen?” conversations.