Cloud, distributed system, service, state, replica, failure domain.
Only the OS questions we need for cloud: running work, resources, state, communication, isolation, failure.
Create a Linux VM, place compute and data, then read modern cloud-native architecture patterns.
Read a large system as compute, state, traffic, and boundaries.
Cloud means renting computing resources through APIs.
Example: instead of buying a server, you create a VM, database, load balancer, or storage bucket from a cloud console or script.
Key idea: infrastructure becomes programmable.
A distributed system uses multiple machines that cooperate over a network.
Example: one request may touch a web server, auth service, database primary, cache, and queue.
Key idea: the system should feel like one service, even though it is many nodes.
A service is a deployable unit with a clear API.
Example: auth service checks login; order service creates orders; storage service saves files.
Key idea: services split a large system into smaller responsibilities.
State is data the system must remember and protect.
Example: account balance, order status, file contents, shopping cart, session token.
Key idea: stateless compute is easy to replace; stateful data needs a clear owner and location.
A replica is a copy placed on another node or location.
Example: one database primary in zone A, read replicas in zone B and zone C.
Key idea: replicas help availability and read scale, but they can disagree for a while.
A failure domain is a boundary where things can fail together.
Example: one machine, one rack, one availability zone, one region, or one network dependency.
Key idea: reliable systems spread replicas across different failure domains.
Not a full review. Just the few OS questions that help us read cloud systems.
Use OS as a lens, not as a lookup table.
Processes give us the basic question. In cloud, the running thing may be a VM, container, function, or service instance.
CPU, memory, disk, and network are still limited. The cloud has to place, resize, and isolate them.
Files become persistent disks, object storage, databases, caches, queues, and logs.
Sockets become APIs, service calls, load balancers, queues, and event streams.
The cloud must separate users, projects, networks, permissions, and machines.
Instead of one machine failing, now a disk, VM, zone, region, dependency, or network path can fail.
Start with one Linux VM, then decide where compute, data, and traffic should live.
A VM is a software-defined computer.
It looks like one machine to the operating system: CPU, memory, disk, and network card.
But those parts are virtual. A hypervisor maps them onto real hardware and keeps different VMs isolated.
Example: UTM or VMware running Kali Linux on your laptop.
Simple picture: one physical server can safely host many smaller "computers."
You are not installing Linux by hand.
Instance means a cloud VM: a virtual computer you can SSH into.
Machine type means how much virtual CPU and memory this VM gets.
Control plane is Google's management system. It receives your API request and finds a real server with capacity.
Metadata server is a small cloud endpoint inside the VM. Linux reads it for SSH keys, hostname, network setup, and startup scripts.
Guest environment is Google-provided software inside Linux that helps the VM talk to GCP.
A checkout service copy should be replaceable.
If it crashes, the load balancer can send the next request to another copy.
That only works if important state is not trapped inside that one VM.
So the service keeps durable facts in Cloud SQL, files in Cloud Storage, and background tasks in Pub/Sub.
Two service copies run in different zones. The load balancer checks health before sending traffic.
One zone stops responding. Requests to that copy fail health checks and are removed from routing.
Healthy copies keep serving. Data survives because orders and files are outside the failed VM.
Design principle: spread replaceable compute across failure domains; keep important state in managed, replicated storage.
The newest architecture idea is not just "more servers." It is: describe what you want, let the platform keep it true.
Modern cloud work is often declarative.
You say: "run 3 copies, expose this service, attach this policy, store this secret."
The control plane keeps reconciling reality toward that desired state.
This is why Kubernetes, Terraform, Cloud Run, and managed databases feel less like manual server setup and more like system programming.
In modern systems, service-to-service traffic is managed deliberately.
An API gateway handles the front door: auth, rate limits, routing, and public APIs.
A service mesh or platform proxy can handle internal calls: mTLS, retries, timeouts, tracing, and policy.
This does not remove distributed-system problems. It gives teams one place to control the traffic rules.
Now read the same checkout path as a distributed system: independent services, shared user experience.
The work runs on many machines: web servers, workers, VMs, containers, or functions.
The data may live in databases, caches, object stores, queues, or replicated copies.
Requests move through DNS, CDN, load balancers, gateways, and internal service calls.
High-level reading habit: ask where the work runs, where the data lives, and how requests move between them.
More machines can serve more users and more data than one machine can.
Place work or content closer to users so requests travel less far.
Keep copies or capacity in more than one place so the service can continue.
Different services can own different jobs: login, payment, search, storage, analytics.
Teams can deploy, monitor, and scale parts of the system separately.
Put expensive resources only where they are needed, and use managed services when useful.
The browser shows one action. Internally, different teams or services own different facts, and those facts may live on different machines.
Checks item IDs, prices, quantities, coupons, and whether the cart is still valid.
Asks the payment provider for authorization. It should not create an order by itself.
Reserves stock, often with its own database because product availability changes fast.
Writes the final order record. This is usually the durable source of truth.
Publishes "order created" so email, shipping, analytics, and search can catch up.
Distributed-system design asks: how do these owners agree on one result when messages are slow, duplicated, or fail?
The system must validate the cart, authorize payment, reserve stock, and commit the order.
If this part fails, the user should not see "order placed."
The system can send email, create a shipping task, update analytics, and refresh search indexes.
A queue stores this follow-up work so checkout does not wait for every slow consumer.
The caller cannot wait forever. It needs a clear timeout so one slow service does not freeze the whole user request.
Retry only when the operation is safe to repeat, or when the request has an idempotency key.
If payment succeeds but inventory fails, the system may cancel the order or refund later.
Principle: distributed design is often about choosing what must be immediate, what can be delayed, and what must be undone safely.
Which machines, services, containers, or functions run the work?
Where do accounts, orders, files, messages, and logs live?
How do users and services reach the right place?
Which parts are separate services, teams, zones, regions, or providers?
Which APIs create, move, scale, and monitor the resources?
What simple service does the user see on top of all that distribution?
Cloud is not just "someone else's computer."
Cloud turns compute, storage, networking, security, and operations into distributed, programmable services.
Why study OS? To understand resources, abstraction, isolation, concurrency, and failure.
Why study cloud and distributed systems? To understand what happens when those OS problems are split across machines, zones, and regions.
The core job of systems engineering is still the same: turn complex resources into reliable, controllable, understandable abstractions.