Cloud Architecture,
Distributed Systems,
and OS

Today’s Roadmap

1. Vocabulary

Cloud, distributed system, service, state, replica, failure domain.

2. Short OS Reminder

Only the OS questions we need for cloud: running work, resources, state, communication, isolation, failure.

3. Cloud Example

Create a Linux VM, place compute and data, then read modern cloud-native architecture patterns.

4. Distributed Overview

Read a large system as compute, state, traffic, and boundaries.

Concept 1

Cloud

Cloud means renting computing resources through APIs.

Example: instead of buying a server, you create a VM, database, load balancer, or storage bucket from a cloud console or script.

Key idea: infrastructure becomes programmable.

Developer API call Cloud Platform VM + DB + Storage Network + IAM + Logs
Concept 2

Distributed System

A distributed system uses multiple machines that cooperate over a network.

Example: one request may touch a web server, auth service, database primary, cache, and queue.

Key idea: the system should feel like one service, even though it is many nodes.

Node A Node B Node C messages
Concept 3

Service

A service is a deployable unit with a clear API.

Example: auth service checks login; order service creates orders; storage service saves files.

Key idea: services split a large system into smaller responsibilities.

Client Auth Service Order Service Storage Service
Concept 4

State

State is data the system must remember and protect.

Example: account balance, order status, file contents, shopping cart, session token.

Key idea: stateless compute is easy to replace; stateful data needs a clear owner and location.

Service Service State must survive
Concept 5

Replica

A replica is a copy placed on another node or location.

Example: one database primary in zone A, read replicas in zone B and zone C.

Key idea: replicas help availability and read scale, but they can disagree for a while.

Primary writes Replica reads Replica backup
Concept 6

Failure Domain

A failure domain is a boundary where things can fail together.

Example: one machine, one rack, one availability zone, one region, or one network dependency.

Key idea: reliable systems spread replicas across different failure domains.

Zone A machine rack power Zone B separate failure Zone C another boundary
Overview Dictionary

Cloud Dictionary

cloud words for placement
what goes where?
Region
noun
A geographic cloud area.
Example: deploy close to users in New York or California.
Zone
noun
An isolated data-center group inside a region.
Example: run one copy in zone A and another in zone B.
Edge
noun
A cloud location close to users.
Example: CDN cache serves images before traffic reaches the main region.
Control Plane
noun phrase
The API layer that creates and manages resources.
Example: Google Cloud APIs create databases, services, and load balancers.
Overview Dictionary

Distributed Dictionary

distributed words at a high level
how to read the architecture
Node
noun
One machine, process, or service instance in the system.
Example: one web server, worker, VM, or container.
Message
noun
A network communication between parts of the system.
Example: a frontend asks a backend service for data.
Boundary
noun
A line between parts of the system.
Example: service boundary, zone boundary, region boundary, team boundary.
Coordination
noun
The way distributed parts act like one organized system.
Example: one request may touch several services behind the scenes.

Today’s Themes

Operating System Resource manager processes / memory / files Distributed System Coordination across nodes messages / replicas / boundaries Cloud Platform Operational product layer APIs / billing / IAM / monitoring shared concerns: resources, isolation, failures, state These are the themes we will use today. Then we use one cloud example and one distributed-system overview.

A Very Short OS Reminder

Not a full review. Just the few OS questions that help us read cloud systems.

OS to Cloud

OS Gives Us the Right Questions

Use OS as a lens, not as a lookup table.

What is running?

Processes give us the basic question. In cloud, the running thing may be a VM, container, function, or service instance.

Who gets resources?

CPU, memory, disk, and network are still limited. The cloud has to place, resize, and isolate them.

Where is state?

Files become persistent disks, object storage, databases, caches, queues, and logs.

How do parts talk?

Sockets become APIs, service calls, load balancers, queues, and event streams.

What is protected?

The cloud must separate users, projects, networks, permissions, and machines.

What can fail?

Instead of one machine failing, now a disk, VM, zone, region, dependency, or network path can fail.

Cloud Example
Online Store on Google Cloud Platform

Start with one Linux VM, then decide where compute, data, and traffic should live.

Local VM First

First: VM on Your Own Laptop

A VM is a software-defined computer.

It looks like one machine to the operating system: CPU, memory, disk, and network card.

But those parts are virtual. A hypervisor maps them onto real hardware and keeps different VMs isolated.

Example: UTM or VMware running Kali Linux on your laptop.

Simple picture: one physical server can safely host many smaller "computers."

Physical server real CPU, RAM, SSD, network cards Hypervisor VM A Kali Linux VM B Windows + API VM C database test The guest OS thinks it owns a computer; the hypervisor is sharing the real one.
VM Vocabulary

ISO, Image, Boot Disk: Where They Appear

Local laptop VM: you do the installation 1. Download ubuntu.iso installer file 2. Attach ISO UTM / VMware pretend CD/DVD 3. Boot installer runs choose disk / user 4. Install Linux writes files into virtual disk 5. Boot Disk your installed VM now acts like hard drive Cloud VM: the provider already prepared the installer result 1. Pick OS Ubuntu / Debian no ISO upload 2. OS Image pre-installed template Linux + boot files 3. Boot Disk copied from image persistent disk 4. Linux VM boots from disk SSH ready Key difference Laptop: ISO is the installer you run. Cloud: OS image is the already-installed template used to create your boot disk.
Cloud VM

Create a Linux Instance on GCP

You are not installing Linux by hand.

Instance means a cloud VM: a virtual computer you can SSH into.

Machine type means how much virtual CPU and memory this VM gets.

Control plane is Google's management system. It receives your API request and finds a real server with capacity.

Metadata server is a small cloud endpoint inside the VM. Linux reads it for SSH keys, hostname, network setup, and startup scripts.

Guest environment is Google-provided software inside Linux that helps the VM talk to GCP.

Create zone, type, OS OS image Ubuntu / Debian Boot disk persistent disk Physical host + hypervisor virtual CPU / RAM / network are presented to the VM boot from the attached disk Linux boots kernel + services Metadata SSH keys / hostname Ready VM SSH into Linux The OS is already inside the image; GCP turns that image into a bootable, networked Linux machine.
Physical Distribution

Cloud Placement: Region and Zone

us-east1 New York checkout europe-west2 London front door asia-southeast1 Singapore reads traffic routes to nearby regions Region country, latency, law, disaster Zone same region, separate failure Design choice serve local, replicate carefully
Real Company Placement

Real Example: Netflix Open Connect

Publicly documented pattern: keep control in the cloud, but move heavy video bytes near viewers. Netflix cloud login, catalog, control decides what should be placed Step 1: control plane This part manages users, titles, sessions, analytics, and placement decisions. Open Connect Netflix-owned CDN moves popular titles outward Step 2: content placement Popular movies and episodes are copied before thousands of viewers request them. ISP cache near one viewer group ISP cache near another viewer group ISP cache near another viewer group Step 3: edge caches The heavy video bytes sit closer to viewers, often inside or near ISP networks. Step 4: when the viewer presses play The app still uses cloud control logic, but the video stream is served from the closest useful cache instead of crossing the globe.
Logical Distribution

Cloud Placement: Edge, Services, Data

Users web / mobile / API Edge Cloud DNS Cloud CDN Cloud Armor Compute Engine GKE / Cloud Run service A service B worker easy to scale and replace Stateful Data Tier Cloud SQL Cloud Storage Memorystore Pub/Sub harder to move safely
Reference Architecture

GCP Example: Main Service Structure

Client Cloud Load Balancing Compute Engine GKE / Cloud Run auth service business service worker service stateless, replaceable Cloud SQL writes Replica reads Pub/Sub async work Cloud Storage files / logs
Cloud Case

Checkout Request: Step by Step

One click becomes a short synchronous request path plus slower background work. before the user sees success after the order is committed 1 User clicks checkout 2 Edge DNS, CDN, WAF 3 Load balancer picks healthy copy 4 Checkout validates cart coordinates next calls 5 Payment authorize charge 6 Inventory reserve stock 7 Orders DB commit order durable truth 8 Queue email, shipping, logs The user waits for payment, inventory, and order commit. The queue is for work that can happen later.
Cloud Design Principle

Why Stateless Compute Helps

A checkout service copy should be replaceable.

If it crashes, the load balancer can send the next request to another copy.

That only works if important state is not trapped inside that one VM.

So the service keeps durable facts in Cloud SQL, files in Cloud Storage, and background tasks in Pub/Sub.

service A replaceable service B replaceable managed state Cloud SQL: orders Cloud Storage: files Pub/Sub: background work harder to move, easier to protect centrally
Cloud Failure Case

If One Zone Has Trouble

Before failure

Two service copies run in different zones. The load balancer checks health before sending traffic.

During failure

One zone stops responding. Requests to that copy fail health checks and are removed from routing.

After reroute

Healthy copies keep serving. Data survives because orders and files are outside the failed VM.

Design principle: spread replaceable compute across failure domains; keep important state in managed, replicated storage.

Modern Cloud Architecture

Modern Cloud Systems
Are Controlled Systems

The newest architecture idea is not just "more servers." It is: describe what you want, let the platform keep it true.

Modern Pattern 1

Declare the Desired System

Modern cloud work is often declarative.

You say: "run 3 copies, expose this service, attach this policy, store this secret."

The control plane keeps reconciling reality toward that desired state.

This is why Kubernetes, Terraform, Cloud Run, and managed databases feel less like manual server setup and more like system programming.

Desired state 3 service copies policy + image + region Control plane compare create repair actual resources
Modern Pattern 2

Package Services, Not Servers

VM image whole machine template OS + packages + config Container image one service packaged checkout:v24 portable across machines Runtime platform Kubernetes / Cloud Run / functions copy 1 copy 2 scale up, replace, roll back Design principle: ship a repeatable unit; let the platform decide where and how many copies run.
Modern Pattern 3

Traffic Becomes an Architecture Layer

In modern systems, service-to-service traffic is managed deliberately.

An API gateway handles the front door: auth, rate limits, routing, and public APIs.

A service mesh or platform proxy can handle internal calls: mTLS, retries, timeouts, tracing, and policy.

This does not remove distributed-system problems. It gives teams one place to control the traffic rules.

Client Gateway auth / rate limit public traffic checkout payment inventory Traffic rules: timeout, retry, encryption, trace ID, policy
Modern Pattern 4

Observe and Roll Out Safely

Production service v23 handles most traffic v24 gets 5% canary release Telemetry metrics: latency, errors logs: what happened traces: request path Decision healthy: increase rollout bad: roll back quickly operate with evidence Design principle: in a distributed system, you cannot inspect one machine and understand the whole story.

Distributed Systems
High-Level View

Now read the same checkout path as a distributed system: independent services, shared user experience.

Distributed Overview

One Service, Many Machines

User sees one service Front Door routes traffic Service instance A Service instance B Service instance C Distributed means the inside is split, even when the outside looks like one system.
Distributed Overview

What Gets Distributed?

Compute

The work runs on many machines: web servers, workers, VMs, containers, or functions.

State

The data may live in databases, caches, object stores, queues, or replicated copies.

Traffic

Requests move through DNS, CDN, load balancers, gateways, and internal service calls.

High-level reading habit: ask where the work runs, where the data lives, and how requests move between them.

Distributed Overview

Why Split a System?

Scale

More machines can serve more users and more data than one machine can.

Distance

Place work or content closer to users so requests travel less far.

Availability

Keep copies or capacity in more than one place so the service can continue.

Specialization

Different services can own different jobs: login, payment, search, storage, analytics.

Operations

Teams can deploy, monitor, and scale parts of the system separately.

Cost

Put expensive resources only where they are needed, and use managed services when useful.

Distributed Overview

A Simple Mental Model

One API what users call login, checkout, play, search Many Nodes machines or processes each owns part of the work Shared View what the system presents one account, one order, one video The job of distributed-system design is to make the split feel organized.
Distributed Case

Checkout Is One Button, Several Owners

The browser shows one action. Internally, different teams or services own different facts, and those facts may live on different machines.

1

Cart service

Checks item IDs, prices, quantities, coupons, and whether the cart is still valid.

2

Payment service

Asks the payment provider for authorization. It should not create an order by itself.

3

Inventory service

Reserves stock, often with its own database because product availability changes fast.

4

Order service

Writes the final order record. This is usually the durable source of truth.

5

Event stream

Publishes "order created" so email, shipping, analytics, and search can catch up.

Distributed-system design asks: how do these owners agree on one result when messages are slow, duplicated, or fail?

Distributed Case

Queue Separates Waiting from Follow-Up

Before returning success

The system must validate the cart, authorize payment, reserve stock, and commit the order.

If this part fails, the user should not see "order placed."

After returning success

The system can send email, create a shipping task, update analytics, and refresh search indexes.

A queue stores this follow-up work so checkout does not wait for every slow consumer.

Checkout Orders DB committed Order event saved in queue Consumers email shipping analytics The queue turns one fragile long request into a short commit plus retryable background work.
Distributed Case

When One Service Is Slow

Timeout

The caller cannot wait forever. It needs a clear timeout so one slow service does not freeze the whole user request.

Retry

Retry only when the operation is safe to repeat, or when the request has an idempotency key.

Compensate

If payment succeeds but inventory fails, the system may cancel the order or refund later.

Principle: distributed design is often about choosing what must be immediate, what can be delayed, and what must be undone safely.

Read Any Cloud Architecture This Way

Compute

Which machines, services, containers, or functions run the work?

State

Where do accounts, orders, files, messages, and logs live?

Traffic

How do users and services reach the right place?

Boundary

Which parts are separate services, teams, zones, regions, or providers?

Control

Which APIs create, move, scale, and monitor the resources?

Experience

What simple service does the user see on top of all that distribution?

The Final Takeaway

Cloud is not just "someone else's computer."

Cloud turns compute, storage, networking, security, and operations into distributed, programmable services.

Why study OS? To understand resources, abstraction, isolation, concurrency, and failure.

Why study cloud and distributed systems? To understand what happens when those OS problems are split across machines, zones, and regions.

From One Machine
to a Whole Cloud

The core job of systems engineering is still the same: turn complex resources into reliable, controllable, understandable abstractions.

Left/Right slides • Space reveals