System architecture design process

Iterative System Design — Thinking Framework

This framework structures system design as a series of iterations. Instead of designing for scale and complexity upfront, the process begins with the simplest possible system and progressively improves it. The approach is inspired by Google’s Non-Abstract Large Scale Design (NALSD) and extended to explicitly cover operability and system management.

At each iteration, pressure is applied, limitations are identified, and the design is adapted. Mechanisms such as caching, asynchronous processing, or partitioning are introduced only when a concrete bottleneck appears.

This keeps the design incremental, grounded in real constraints, and resistant to unnecessary complexity.

Solution Architecture Design Process

Goal of the exercise

Output

Final C2 diagram
Trade-offs (list)
Key decisions

The goal is to produce a clear, justified architecture. Everything else exists to derive or validate that.

Step 0 — Capture requirements

Principle: Requirements Must Be Explicit Before Designing

Output

User Journey cards
Constraints / NFR card
C1 context diagram

Discover the requirements by asking focused questions to the product owner or interviewer until the main user journeys, constraints, and system scope are clear.

What follows defines the format of the requirement artifacts.

User Journey cards

Format:

Name
Goal
Steps
Critical rules
Notes

Example:

User Journey — Patient books appointment

Goal Book an available slot with a doctor

Steps

Patient views available slots
Patient selects a slot
System checks availability
System creates appointment
System marks slot as booked
System confirms booking

Critical rules

A slot cannot be double-booked
An appointment belongs to one patient and one doctor

Notes

Booking should be fast
Notifications can be asynchronous

Constraints / NFR card

Format:

one card with bullet points

Example:

Prevent double booking
Fast booking confirmation
Notifications can happen later
Handle peak booking traffic
Patients and doctors must be authenticated

C1 context diagram

Defines:

actors
system boundary
external systems

Step 1 — Derive the simplest C2 from the User Journeys

Principle: Start With The Simplest System That Supports The Journeys

Output

Initial C2 diagram

Follow the user journeys step by step and derive the minimum set of responsibilities, components, interactions, and data needed to support them.

Keep it minimal:

no premature services
no unnecessary infrastructure

Step 2 — Refine architecture boundaries

Principle: Boundaries Should Reflect Ownership, Cohesion, And Rules

Output

Refined C2 diagram
Trade-offs (list)

Review the initial architecture in both directions: merge components where separation is artificial, and extract bounded contexts where a real domain or ownership boundary exists.

Review and adjust boundaries:

merge artificial separations
extract real bounded contexts
align responsibilities and data ownership
define source of truth per concept
ensure rules are enforced in the right place

Trade-offs format

Each trade-off is captured as:

Decision
Alternative
Why
Trade-off

Example:

Decision: Keep booking and availability in one service
Alternative: Split into separate services
Why: Strong consistency needed
Trade-off: Less independent scalability

Step 3 — Refine architecture under stress and failure

Principle: No Mechanism Without A Real Problem

Output

Final / near-final C2
Trade-offs (updated list)

This step is iterative:

apply pressure → identify issue → adjust → repeat

Apply concrete pressures

Load & growth

10x traffic
large data volumes

Concurrency & correctness

concurrent booking of the same slot
retries / duplicate requests

Read vs write imbalance

heavy search vs booking

Latency

response time expectations

Failures

dependency unavailable
timeouts

Possible outcomes

no change
caching
async flows
queue/broker
read models
service extraction
transaction boundary clarification
traffic control

Trade-offs

Update the trade-off list as decisions evolve

Example:

Decision: Introduce async notifications
Alternative: Keep synchronous
Why: Reduce latency on booking path
Trade-off: Eventual delivery

Step 4 — Ensure operability

Principle: If a System Cannot Be Observed, It Cannot Be Operated

Output

Operability notes (bullet list)
Any required updates to C2

Review the architecture from an operational perspective and make explicit how the system will be monitored, debugged, deployed safely, and recovered when something goes wrong.

Format:

key metrics (latency, errors, throughput)
logging needs
alerting triggers
deployment strategy
rollback approach
recovery approach