System architecture design process

Iterative System Design — Thinking Framework

This framework structures system design as a series of iterations. Instead of designing for scale and complexity upfront, the process begins with the simplest possible system and progressively improves it. The approach is inspired by Google’s Non-Abstract Large Scale Design (NALSD) and extended to explicitly cover operability and system management.

At each iteration, pressure is applied, limitations are identified, and the design is adapted. Mechanisms such as caching, asynchronous processing, or partitioning are introduced only when a concrete bottleneck appears.

This keeps the design incremental, grounded in real constraints, and resistant to unnecessary complexity.

Iteration 0 — Reality Anchor
- Goal: Remove abstraction. Define a concrete minimal system.
- Core Question: What exactly are we building and how does a user interact with it?
- Define explicitly:
  - 1–3 concrete user flows
  - Core data entities
  - Basic request/response flow
  - Any obvious constraints (consistency expectations, geography, etc.)
- Design posture:
  - Simplest possible architecture
  - Prefer a single service
  - Prefer a single datastore
  - No caching
  - No async processing
  - No scaling mechanisms
- Assumption: System serves few users at low request rate.
- Exit condition: A clear design exists for the simplest version of the system.
Iteration 1 — Load & Bottlenecks
- Goal: Improve performance and scalability as system pressure increases.
- Core Question: As usage grows, what becomes the bottleneck and how should the system adapt?
- 2.1 Apply pressure
  - Increase usage assumptions
  - Increase request rate
  - Increase dataset size
- 2.2 Identify bottleneck
  - Database load
  - CPU saturation
  - Memory pressure
  - Network overhead
  - Contention hotspots
  - Large payloads
- 2.3 Adapt system
  - Introduce mechanisms only to address the bottleneck.
  - Typical patterns:
    - Caching (read caching, write-through, distributed cache)
    - Edge distribution (CDN, edge caches, regional replicas, edge compute)
    - Data access optimization (read replicas, partitioning/sharding)
    - Read/write separation (CQRS, specialized read models, dedicated read services)
    - Workload decoupling (async processing jobs, message queues, event-driven flows)
    - Service decomposition (extract services, microservices, domain boundaries, independent scaling)
- 2.4 Repeat
  - Apply more pressure
  - Identify the next bottleneck
  - Adapt again
- Rule: No mechanism without a named bottleneck.
- Exit condition: The system evolves to handle increasing pressure through successive adaptations.
Iteration 2 — Failure Behavior
- Goal: Define how the system behaves when parts fail.
- Core Question: When components fail, how does the system respond?
- Consider failures such as:
  - Service instance failure
  - Database unavailability
  - Cache failure
  - Dependency timeouts
- Typical patterns:
  - Failure control (timeouts, retries with backoff, circuit breakers)
  - Graceful degradation (reduced functionality, stale data, partial responses)
  - Redundancy (replication, multiple instances)
  - Recovery (backup and restore strategies)
  - Safe retries (idempotent operations)
- Rule: Every dependency must have a defined failure behavior.
- Exit condition: Failure scenarios and system responses are clearly defined.
Iteration 3 — Operability
- Goal: Make the system observable and manageable in production.
- Core Question: How do we know the system is healthy and how do we intervene when it is not?
- Define:
  - Signals that indicate system health
  - Signals that indicate degradation
- Typical patterns:
  - Observability (metrics, logging, distributed tracing)
  - Alerting (alerts on latency, errors, saturation)
  - Health management (health checks, service monitoring)
  - Deployment safety (canary releases, blue/green deployments)
  - Recovery mechanisms (rollback strategies)
- Rule: If you cannot detect degradation, you cannot control the system.
- Exit condition: Operators can detect issues and respond effectively.