Advanced Operating Systems
Abstraction, resource management, scheduling, protection, and memory
Lecture 2 draft | Reading-driven slides
The real question is: where do OS ideas reappear in the AI era?
Classic OS manages processes / pages / files / devices
AI systems manage requests / agents / context / KV cache / tools
Update model weights. Offline, expensive, and often multi-day or multi-month.
Goal: learn the parameters.
Keep weights fixed. Given a prompt, generate output one token at a time.
Goal: low latency, high throughput, controlled cost.
OS perspective: inference is an online serving system, not a single neural-network call.
Example: when 500 students ask ChatGPT questions during office hours, the system must decide which requests run together, how much GPU memory each gets, and how to keep latency acceptable.
A model does not write the whole answer at once; it repeatedly predicts the next token.
User input:
Explain virtual memory.
The model does not directly process strings; it processes token IDs.
Each token ID becomes a vector that enters the Transformer.
Citation: Vaswani et al., Attention Is All You Need, Sections 1 and 3.
Citation: Vaswani et al., Attention Is All You Need, Section 3.
Citation: Vaswani et al., Attention Is All You Need, Section 3.2.
When generating token 1001, recompute K/V for the first 1000 tokens in every layer.
A lot of repeated work.
Compute each token's K/V once, store it, and reuse it in later decoding steps.
Trade memory for time.
KV cache is the working set of inference: longer context and more concurrent requests require more GPU memory.
Input: "Explain virtual memory."
1. predict "Virtual"
2. context += "Virtual"; predict " memory"
3. context += " memory"; predict " is"
4. context += " is"; predict " an"
5. ... repeat until stop token
Every step reads the historical KV cache. Longer answers make the cache larger; more users make it harder to manage.
Citation: Stoica et al., A Berkeley View of Systems Challenges for AI, Sections 3-4.
Turn complex, heterogeneous, changing machinery into stable interfaces
| Classic OS | Abstraction | AI-era equivalent |
|---|---|---|
| CPU | Process / thread | LLM request / agent task |
| Physical memory | Virtual memory | Context window / KV cache blocks |
| Disk / devices | File descriptor | Tool handle / API capability |
| Network | Socket | Model endpoint / MCP server / remote tool |
| Kernel services | Syscalls | Agent kernel calls: memory, storage, tool, LLM |
Imagine a travel agent that can read your calendar, book flights, pay with a card, and email receipts.
should the OS manage it like an ordinary app?
Or do we need a new agent OS abstraction?
Citation: Mei et al., AIOS: LLM Agent Operating System, Figure 1 and Introduction.
A single task touches preferences, APIs, disk upload, payment software, calendar software, and text generation.
Some services are LLM-managed; others are OS-managed. The agent must coordinate both safely.
This needs resource management, scheduling, access control, and auditability, not only prompting.
Citation: AIOS GitHub README, Architecture Overview; Mei et al., AIOS, Sections 2-3.
Agents are no longer passive apps: they call LLMs, tools, files, memory, and external APIs.
AIOS inserts an agent kernel between applications and resources, exposing agent-level system calls.
Scheduling, context switching, memory, storage, tools, and access control move into one control plane.
README: AIOS “embeds large language model (LLM) into the operating system.”
README: the AIOS kernel is an “abstraction layer over the operating system kernel.”
Agents are treated like applications; the AIOS Kernel exposes services for LLM, memory, storage, tools, and access control.
Citation: AIOS GitHub README, “Modules and Connections”; Mei et al., AIOS, Sections 3.2-3.8.
Agent app calls SDK modules.
Calls become AIOS syscalls.
Queues and managers control resources.
Scarce resources + many competitors + dynamic demand
Model weights, activations, and KV cache.
Prefill is compute-heavy; decode is often memory-bound.
Prompt, history, retrieved documents, and tool outputs.
External API calls, cost, and rate limits.
Classic OS problems reappear: allocation, fragmentation, sharing, reclamation, accounting.
Citation: Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention, Sections 3-4.
When a token passes through each Transformer layer, that layer produces the token's Key and Value.
Example: after reading “The OS manages memory,” layer 1 stores K(The), V(The), K(OS), V(OS), K(manages), V(manages), K(memory), V(memory).
Each one is a numeric fingerprint, such as K(The) = [0.12, -0.48, ...], not the word itself.
Prefill: compute K/V for all prompt tokens and write them into cache.
Decode: for every new token, read all historical K/V and append the new token's K/V.
The next step reuses cached history instead of recomputing it.
This is a classic space-time tradeoff.
KV size = 2 x layers x heads x dim x seq_len x dtype
Output length is unknown, so KV cache grows dynamically during decode.
Words are shown in examples only as labels; the real cache stores numeric hidden vectors.
Citation: Kwon et al., PagedAttention, Section 4.1.
Each request has unknown output length; preallocating max length wastes GPU memory, while dynamic growth fragments it.
Split KV cache into fixed-size blocks and use a block table to map logical token history to physical GPU blocks.
Requests can grow on demand, physical blocks need not be contiguous, and shared prefixes can reuse blocks.
| Virtual Memory | PagedAttention / vLLM | Why it matters |
|---|---|---|
| Virtual page | Logical KV block | The request sees a continuous token history |
| Physical frame | Physical KV block | GPU memory can be non-contiguous |
| Page table | Block table | Indirection from logical to physical blocks |
| Demand paging | Allocate KV blocks on demand | Reduce reserved waste |
| Copy-on-write | Prefix sharing | Beam search / parallel samples share prompt KV |
Reserve maximum output length, or grow dynamically and suffer fragmentation.
Result: low batch size, wasted GPU memory.
KV cache becomes fixed-size blocks mapped on demand.
The problem becomes block allocation plus scheduling.
Core idea: indirection buys flexibility.
Who runs first? For how long? When do we preempt?
“Translate this sentence.”
Latency matters most.
“Summarize this 200-page PDF.”
Throughput and memory footprint matter.
Browse the web, write code, run tests, then respond.
Includes LLM calls, tools, and state.
Larger batches improve throughput, but latency and memory can explode.
System point: LLM serving has phase-specific resource behavior; schedulers can exploit that difference.
Prefill and decode interfere when they share the same GPU pool: one is bulk prompt processing, the other is token-by-token streaming.
Separate queues, batch sizes, or GPU pools can be used for prefill and decode; some systems disaggregate the phases.
Better time-to-first-token for prompts, steadier token streaming for decode, and fewer long requests blocking short ones.
Process the prompt and build the initial KV cache.
Usually more compute-heavy; affects time-to-first-token.
Generate one new token at a time while repeatedly reading and writing KV cache.
Usually more memory/bandwidth-sensitive; affects token streaming.
Advanced OS question: should these phases share the same GPUs, or should they be scheduled like different job classes?
A normal program context switch saves registers, PC, and address-space state.
An LLM agent context switch may need to save:
If multiple agents share one LLM runtime, the system needs scheduling, suspension, resumption, fairness, and access control.
AI agents make security boundaries semantic
Prompt injection hides commands inside normal data, so the dangerous part is semantic, not just a syscall pattern.
Route tool calls through capability checks, data-flow checks, sandboxing, and human approval for high-risk actions.
The agent can still use tools, but the runtime enforces least privilege and blocks unsafe data movement.
Citation: AIOS GitHub README, “Computer-use Specialized Architecture.”
README: the Tool Manager is redesigned to include a VM Controller and MCP Server.
README: this creates a sandboxed environment for computer interaction.
For computer-use agents, tool access becomes device access. The runtime needs isolation, mediation, and auditability.
The context window is the new scarce memory
Citation: Packer et al., MemGPT, Sections 1-2.
The context window is small. A book, old notes, and a long conversation cannot all stay in the prompt.
Keep active information in main context; summarize or store old information externally; retrieve relevant facts when needed.
The agent behaves as if it has long-term memory while still using a fixed-context model.
Citation: Packer et al., MemGPT: Towards LLMs as Operating Systems, Sections 1-2.
Citation: Vaswani et al., Attention Is All You Need, Sections 1 and 3.2.
| Paper | Role in lecture | Suggested section |
|---|---|---|
| Berkeley View | Why AI is a systems problem | Abstract, Sections 3-4 |
| Attention Is All You Need | Q/K/V and Transformer background | Abstract, Sections 1, 3.2 |
| PagedAttention | Virtual memory / paging applied to KV cache | Sections 3-4 |
| MemGPT | Context window as memory hierarchy | Sections 1-2 |
| AIOS | Agent kernel: scheduling, context, memory, access | Abstract, Sections 2-3 |
Q1. Is PagedAttention a successful transfer of an OS idea, or an application-specific hack?
Q2. Should agent tool permissions be controlled by prompts, or by a kernel/control plane?
Q3. Who should decide the context-window replacement policy: the LLM, the runtime, or the user?
Q4. Is AIOS closer to a monolithic kernel or a microkernel?
AI does not make OS obsolete; it makes OS ideas central again.
Next: deeper dive into one case study
Option A: PagedAttention / KV cache memory manager
Option B: Agent OS / sandboxing / access control
Option C: MemGPT / long-term agent memory