System architecture design process
Iterative System Design — Thinking Framework
This framework structures system design as a series of iterations. Instead of designing for scale and complexity upfront, the process begins with the simplest possible system and progressively improves it. The approach is inspired by Google’s Non-Abstract Large Scale Design (NALSD) and extended to explicitly cover operability and system management.
At each iteration, pressure is applied, limitations are identified, and the design is adapted. Mechanisms such as caching, asynchronous processing, or partitioning are introduced only when a concrete bottleneck appears.
This keeps the design incremental, grounded in real constraints, and resistant to unnecessary complexity.
Solution Architecture Design Process
Goal of the exercise
Output
- Final C2 diagram
- Trade-offs (list)
- Key decisions
The goal is to produce a clear, justified architecture. Everything else exists to derive or validate that.
Step 0 — Capture requirements
Principle: Requirements Must Be Explicit Before Designing
Output
- User Journey cards
- Constraints / NFR card
- C1 context diagram
Discover the requirements by asking focused questions to the product owner or interviewer until the main user journeys, constraints, and system scope are clear.
What follows defines the format of the requirement artifacts.
User Journey cards
Format:
- Name
- Goal
- Steps
- Critical rules
- Notes
Example:
User Journey — Patient books appointment
Goal Book an available slot with a doctor
Steps
- Patient views available slots
- Patient selects a slot
- System checks availability
- System creates appointment
- System marks slot as booked
- System confirms booking
Critical rules
- A slot cannot be double-booked
- An appointment belongs to one patient and one doctor
Notes
- Booking should be fast
- Notifications can be asynchronous
Constraints / NFR card
Format:
- one card with bullet points
Example:
- Prevent double booking
- Fast booking confirmation
- Notifications can happen later
- Handle peak booking traffic
- Patients and doctors must be authenticated
C1 context diagram
Defines:
- actors
- system boundary
- external systems
Step 1 — Derive the simplest C2 from the User Journeys
Principle: Start With The Simplest System That Supports The Journeys
Output
- Initial C2 diagram
Follow the user journeys step by step and derive the minimum set of responsibilities, components, interactions, and data needed to support them.
Keep it minimal:
- no premature services
- no unnecessary infrastructure
Step 2 — Refine architecture boundaries
Principle: Boundaries Should Reflect Ownership, Cohesion, And Rules
Output
- Refined C2 diagram
- Trade-offs (list)
Review the initial architecture in both directions: merge components where separation is artificial, and extract bounded contexts where a real domain or ownership boundary exists.
Review and adjust boundaries:
- merge artificial separations
- extract real bounded contexts
- align responsibilities and data ownership
- define source of truth per concept
- ensure rules are enforced in the right place
Trade-offs format
Each trade-off is captured as:
- Decision
- Alternative
- Why
- Trade-off
Example:
- Decision: Keep booking and availability in one service
- Alternative: Split into separate services
- Why: Strong consistency needed
- Trade-off: Less independent scalability
Step 3 — Refine architecture under stress and failure
Principle: No Mechanism Without A Real Problem
Output
- Final / near-final C2
- Trade-offs (updated list)
This step is iterative:
apply pressure → identify issue → adjust → repeat
Apply concrete pressures
Load & growth
- 10x traffic
- large data volumes
Concurrency & correctness
- concurrent booking of the same slot
- retries / duplicate requests
Read vs write imbalance
- heavy search vs booking
Latency
- response time expectations
Failures
- dependency unavailable
- timeouts
Possible outcomes
- no change
- caching
- async flows
- queue/broker
- read models
- service extraction
- transaction boundary clarification
- traffic control
Trade-offs
Update the trade-off list as decisions evolve
Example:
- Decision: Introduce async notifications
- Alternative: Keep synchronous
- Why: Reduce latency on booking path
- Trade-off: Eventual delivery
Step 4 — Ensure operability
Principle: If a System Cannot Be Observed, It Cannot Be Operated
Output
- Operability notes (bullet list)
- Any required updates to C2
Review the architecture from an operational perspective and make explicit how the system will be monitored, debugged, deployed safely, and recovered when something goes wrong.
Format:
- key metrics (latency, errors, throughput)
- logging needs
- alerting triggers
- deployment strategy
- rollback approach
- recovery approach