Cloud Architecture,
Distributed Systems,
and OS

Today’s Roadmap

1. Vocabulary

Cloud, distributed system, service, state, replica, failure domain.

2. Short OS Reminder

Only the OS questions we need for cloud: running work, resources, state, communication, isolation, failure.

3. Cloud Example

Create a Linux VM, place compute and data, then read modern cloud-native architecture patterns.

4. Distributed Overview

Read a large system as compute, state, traffic, and boundaries.

Concept 1

Cloud

Cloud means renting computing resources through APIs.

Example: instead of buying a server, you create a VM, database, load balancer, or storage bucket from a cloud console or script.

Key idea: infrastructure becomes programmable.

Concept 2

Distributed System

A distributed system uses multiple machines that cooperate over a network.

Example: one request may touch a web server, auth service, database primary, cache, and queue.

Key idea: the system should feel like one service, even though it is many nodes.

Concept 3

Service

A service is a deployable unit with a clear API.

Example: auth service checks login; order service creates orders; storage service saves files.

Key idea: services split a large system into smaller responsibilities.

Concept 4

State

State is data the system must remember and protect.

Example: account balance, order status, file contents, shopping cart, session token.

Key idea: stateless compute is easy to replace; stateful data needs a clear owner and location.

Concept 5

Replica

A replica is a copy placed on another node or location.

Example: one database primary in zone A, read replicas in zone B and zone C.

Key idea: replicas help availability and read scale, but they can disagree for a while.

Concept 6

Failure Domain

A failure domain is a boundary where things can fail together.

Example: one machine, one rack, one availability zone, one region, or one network dependency.

Key idea: reliable systems spread replicas across different failure domains.

Overview Dictionary

Cloud Dictionary

cloud words for placement

what goes where?

Region

noun

A geographic cloud area.

Example: deploy close to users in New York or California.

Zone

noun

An isolated data-center group inside a region.

Example: run one copy in zone A and another in zone B.

Edge

noun

A cloud location close to users.

Example: CDN cache serves images before traffic reaches the main region.

Control Plane

noun phrase

The API layer that creates and manages resources.

Example: Google Cloud APIs create databases, services, and load balancers.

Overview Dictionary

Distributed Dictionary

distributed words at a high level

how to read the architecture

Node

noun

One machine, process, or service instance in the system.

Example: one web server, worker, VM, or container.

Message

noun

A network communication between parts of the system.

Example: a frontend asks a backend service for data.

Boundary

noun

A line between parts of the system.

Example: service boundary, zone boundary, region boundary, team boundary.

Coordination

noun

The way distributed parts act like one organized system.

Example: one request may touch several services behind the scenes.

Today’s Themes

A Very Short OS Reminder

Not a full review. Just the few OS questions that help us read cloud systems.

OS to Cloud

OS Gives Us the Right Questions

Use OS as a lens, not as a lookup table.

What is running?

Processes give us the basic question. In cloud, the running thing may be a VM, container, function, or service instance.

Who gets resources?

CPU, memory, disk, and network are still limited. The cloud has to place, resize, and isolate them.

Where is state?

Files become persistent disks, object storage, databases, caches, queues, and logs.

How do parts talk?

Sockets become APIs, service calls, load balancers, queues, and event streams.

What is protected?

The cloud must separate users, projects, networks, permissions, and machines.

What can fail?

Instead of one machine failing, now a disk, VM, zone, region, dependency, or network path can fail.

Cloud Example
Online Store on Google Cloud Platform

Start with one Linux VM, then decide where compute, data, and traffic should live.

Local VM First

First: VM on Your Own Laptop

A VM is a software-defined computer.

It looks like one machine to the operating system: CPU, memory, disk, and network card.

But those parts are virtual. A hypervisor maps them onto real hardware and keeps different VMs isolated.

Example: UTM or VMware running Kali Linux on your laptop.

Simple picture: one physical server can safely host many smaller "computers."

VM Vocabulary

ISO, Image, Boot Disk: Where They Appear

Cloud VM

Create a Linux Instance on GCP

You are not installing Linux by hand.

Instance means a cloud VM: a virtual computer you can SSH into.

Machine type means how much virtual CPU and memory this VM gets.

Control plane is Google's management system. It receives your API request and finds a real server with capacity.

Metadata server is a small cloud endpoint inside the VM. Linux reads it for SSH keys, hostname, network setup, and startup scripts.

Guest environment is Google-provided software inside Linux that helps the VM talk to GCP.

Physical Distribution

Cloud Placement: Region and Zone

Real Company Placement

Real Example: Netflix Open Connect

Logical Distribution

Cloud Placement: Edge, Services, Data

Reference Architecture

GCP Example: Main Service Structure

Cloud Case

Checkout Request: Step by Step

Cloud Design Principle

Why Stateless Compute Helps

A checkout service copy should be replaceable.

If it crashes, the load balancer can send the next request to another copy.

That only works if important state is not trapped inside that one VM.

So the service keeps durable facts in Cloud SQL, files in Cloud Storage, and background tasks in Pub/Sub.

Cloud Failure Case

If One Zone Has Trouble

Before failure

Two service copies run in different zones. The load balancer checks health before sending traffic.

During failure

One zone stops responding. Requests to that copy fail health checks and are removed from routing.

After reroute

Healthy copies keep serving. Data survives because orders and files are outside the failed VM.

Design principle: spread replaceable compute across failure domains; keep important state in managed, replicated storage.

Modern Cloud Architecture

Modern Cloud Systems
Are Controlled Systems

The newest architecture idea is not just "more servers." It is: describe what you want, let the platform keep it true.

Modern Pattern 1

Declare the Desired System

Modern cloud work is often declarative.

You say: "run 3 copies, expose this service, attach this policy, store this secret."

The control plane keeps reconciling reality toward that desired state.

This is why Kubernetes, Terraform, Cloud Run, and managed databases feel less like manual server setup and more like system programming.

Modern Pattern 2

Package Services, Not Servers

Modern Pattern 3

Traffic Becomes an Architecture Layer

In modern systems, service-to-service traffic is managed deliberately.

An API gateway handles the front door: auth, rate limits, routing, and public APIs.

A service mesh or platform proxy can handle internal calls: mTLS, retries, timeouts, tracing, and policy.

This does not remove distributed-system problems. It gives teams one place to control the traffic rules.

Modern Pattern 4

Observe and Roll Out Safely

Distributed Systems
High-Level View

Now read the same checkout path as a distributed system: independent services, shared user experience.

Distributed Overview

One Service, Many Machines

Distributed Overview

What Gets Distributed?

Compute

The work runs on many machines: web servers, workers, VMs, containers, or functions.

State

The data may live in databases, caches, object stores, queues, or replicated copies.

Traffic

Requests move through DNS, CDN, load balancers, gateways, and internal service calls.

High-level reading habit: ask where the work runs, where the data lives, and how requests move between them.

Distributed Overview

Why Split a System?

Scale

More machines can serve more users and more data than one machine can.

Distance

Place work or content closer to users so requests travel less far.

Availability

Keep copies or capacity in more than one place so the service can continue.

Specialization

Different services can own different jobs: login, payment, search, storage, analytics.

Operations

Teams can deploy, monitor, and scale parts of the system separately.

Cost

Put expensive resources only where they are needed, and use managed services when useful.

Distributed Overview

A Simple Mental Model

Distributed Case

Checkout Is One Button, Several Owners

The browser shows one action. Internally, different teams or services own different facts, and those facts may live on different machines.

1

Cart service

Checks item IDs, prices, quantities, coupons, and whether the cart is still valid.

2

Payment service

Asks the payment provider for authorization. It should not create an order by itself.

3

Inventory service

Reserves stock, often with its own database because product availability changes fast.

4

Order service

Writes the final order record. This is usually the durable source of truth.

5

Event stream

Publishes "order created" so email, shipping, analytics, and search can catch up.

Distributed-system design asks: how do these owners agree on one result when messages are slow, duplicated, or fail?

Distributed Case

Queue Separates Waiting from Follow-Up

Before returning success

The system must validate the cart, authorize payment, reserve stock, and commit the order.

If this part fails, the user should not see "order placed."

After returning success

The system can send email, create a shipping task, update analytics, and refresh search indexes.

A queue stores this follow-up work so checkout does not wait for every slow consumer.

Distributed Case

When One Service Is Slow

Timeout

The caller cannot wait forever. It needs a clear timeout so one slow service does not freeze the whole user request.

Retry

Retry only when the operation is safe to repeat, or when the request has an idempotency key.

Compensate

If payment succeeds but inventory fails, the system may cancel the order or refund later.

Principle: distributed design is often about choosing what must be immediate, what can be delayed, and what must be undone safely.

Read Any Cloud Architecture This Way

Compute

Which machines, services, containers, or functions run the work?

State

Where do accounts, orders, files, messages, and logs live?

Traffic

How do users and services reach the right place?

Boundary

Which parts are separate services, teams, zones, regions, or providers?

Control

Which APIs create, move, scale, and monitor the resources?

Experience

What simple service does the user see on top of all that distribution?

The Final Takeaway

Cloud is not just "someone else's computer."

Cloud turns compute, storage, networking, security, and operations into distributed, programmable services.

Why study OS? To understand resources, abstraction, isolation, concurrency, and failure.

Why study cloud and distributed systems? To understand what happens when those OS problems are split across machines, zones, and regions.

From One Machine
to a Whole Cloud

The core job of systems engineering is still the same: turn complex resources into reliable, controllable, understandable abstractions.

Cloud Architecture,Distributed Systems,and OS

Today’s Roadmap

1. Vocabulary

2. Short OS Reminder

3. Cloud Example

4. Distributed Overview

Cloud

Distributed System

Service

State

Replica

Failure Domain

Cloud Dictionary

Distributed Dictionary

Today’s Themes

A Very Short OS Reminder

OS Gives Us the Right Questions

What is running?

Who gets resources?

Where is state?

How do parts talk?

What is protected?

What can fail?

Cloud ExampleOnline Store on Google Cloud Platform

First: VM on Your Own Laptop

ISO, Image, Boot Disk: Where They Appear

Create a Linux Instance on GCP

Cloud Placement: Region and Zone

Real Example: Netflix Open Connect

Cloud Placement: Edge, Services, Data

GCP Example: Main Service Structure

Checkout Request: Step by Step

Why Stateless Compute Helps

If One Zone Has Trouble

Before failure

During failure

After reroute

Modern Cloud SystemsAre Controlled Systems

Declare the Desired System

Package Services, Not Servers

Traffic Becomes an Architecture Layer

Observe and Roll Out Safely

Distributed SystemsHigh-Level View

One Service, Many Machines

What Gets Distributed?

Compute

State

Traffic

Why Split a System?

Scale

Distance

Availability

Specialization

Operations

Cost

A Simple Mental Model

Checkout Is One Button, Several Owners

Cart service

Payment service

Inventory service

Order service

Event stream

Queue Separates Waiting from Follow-Up

Before returning success

After returning success

When One Service Is Slow

Timeout

Retry

Compensate

Read Any Cloud Architecture This Way

Compute

State

Traffic

Boundary

Control

Experience

The Final Takeaway

From One Machineto a Whole Cloud

Cloud Architecture,
Distributed Systems,
and OS

Cloud Example
Online Store on Google Cloud Platform

Modern Cloud Systems
Are Controlled Systems

Distributed Systems
High-Level View

From One Machine
to a Whole Cloud