Part 4: Storage Management

Operating System Concepts — Silberschatz, Galvin, Gagne

Ch 10 Mass Storage • Ch 11 File-System Interface
Ch 12 File-System Implementation • Ch 13 I/O Systems

The Storage Hierarchy

Registers < 1 ns Cache (L1/L2/L3) 1 - 10 ns Main Memory (RAM) 50 - 100 ns • volatile SSD / Flash 25 - 100 μs • non-volatile Hard Disk (HDD) 5 - 10 ms • non-volatile Tape / Cloud Archive • seconds ~ minutes Faster, Smaller, Costlier ↑ Slower, Larger, Cheaper ↑
10

Mass-Storage Systems

HDD • SSD • Disk Scheduling • RAID

Ch 10

Hard Disk Drive — Anatomy

Sector Track Head Spindle 5,400 ~ 15,000 RPM Cylinder = same track across all platters

Access Time

Seek 5ms Rotate 4ms Transfer 0.03ms 99%+ time = mechanical delays! Random I/O extremely expensive on HDD
Ch 10

Disk Access Time Breakdown

Access Time = Seek + Rotation + Transfer Seek Time Move arm to target track seek ~3-9 ms Dominant cost! This is what scheduling algorithms optimize Rotational Latency Wait for sector to spin under head S 7200 RPM avg ~4.2 ms = ½ rotation 15K RPM → 2 ms avg Transfer Time Read/write the actual data sequential read ~0.05 ms / sector Negligible! SATA III: 600 MB/s max Seek + Rotation = 99% of access time — that's why disk scheduling matters
Ch 10

Practice: Disk Access Time

Practice A 15,000 RPM hard disk has an average seek time of 4 ms. Each track has 500 sectors of 512 bytes. (a) Avg rotational latency = ? Rotation period = 60 / 15000 = 4 ms per rotation Avg latency = 4 / 2 = 2 ms (b) Transfer time per sector = ? Time per track = 4 ms Sectors per track = 500 Transfer = 4 / 500 = 0.008 ms (c) Total avg access time = ? Seek = 4 ms Rotation = 2 ms Transfer = 0.008 ms = 6.008 ms ≈ 6 ms 4 + 2 + 0.008 = 6.008 ms ≈ 6 ms ⚡ Bonus: Same operation on NVMe SSD: ~0.1 ms That’s 60× faster!
Ch 10

SSD vs HDD

HDD Random IOPS: ~100 Seq Read: 200 MB/s Latency: ~10 ms Cost: $0.02/GB Moving parts • Vibration sensitive VS SSD (NVMe) Random IOPS: ~100,000 Seq Read: 3,500 MB/s Latency: ~0.1 ms Cost: $0.08/GB No moving parts • Wear leveling needed 1000x faster random!
Ch 10

Disk Scheduling — Setup

Goal

Minimize total seek distance when servicing queued I/O requests

Request queue: 98, 183, 37, 122, 14, 124, 65, 67

Head starts at: cylinder 53

Range: 0 – 199

Cylinders 0 – 199 0 53 100 199 HEAD 14 37 65 67 98 122 124 183 Algorithms we'll compare: FCFS— First Come First Served SSTF— Shortest Seek Time First SCAN— Elevator Algorithm C-SCAN— Circular SCAN LOOK / C-LOOK— Practical variants
Ch 10

FCFS vs SSTF

FCFS — 640 cylinders

0 100 199 53 98 183 37 122 14 124 65 67 Wild zigzag!

SSTF — 236 cylinders

0 100 199 53 65 67 37 14 98 122 124 183 Much smoother, but may starve
Ch 10

SCAN & C-SCAN

SCAN (Elevator)

0 199 14 37 65 67 98 122 124 183 53 (start) Go to end, reverse, continue No starvation • Bounded wait

C-SCAN (Circular)

0 199 Jump back (no service) Service one direction only → More uniform wait time than SCAN
Ch 10

LOOK & C-LOOK (Practical Variants)

LOOK

0 199 head start reverse here! SCAN goes to end of disk (0 or 199) LOOK reverses at last request

C-LOOK

0 199 jump (no service) C-SCAN goes to end, wraps to 0 C-LOOK wraps at first request

LOOK/C-LOOK are the algorithms actually used in real operating systems — more efficient than pure SCAN/C-SCAN

Ch 10

Scheduling Comparison

Total Head Movement (cylinders) FCFS 640 C-SCAN 382 C-LOOK 322 SCAN 236 SSTF 236 LOOK 208 SSTF/LOOK best for light loads • SCAN/C-SCAN best for heavy loads • SSDs need no scheduling!
Ch 10

Practice: Disk Scheduling

Practice Request queue: [23, 89, 132, 42, 187, 64, 157, 98] — Head at 100, disk 0–199 Calculate total head movement for each algorithm FCFS: |100-23|+|23-89|+|89-132|+|132-42|+|42-187|+|187-64|+|64-157|+|157-98| = 77+66+43+90+145+123+93+59 = 696 SSTF: 100→98→89→64→42→23→132→157→187 = 2+9+25+22+19+109+25+30 = 241 SCAN (toward 0): 100→98→89→64→42→23→0→132→157→187 = 100 (to 0) + 187 (to 187) = 287 C-LOOK (toward 199): 100→132→157→187→23→42→64→89→98 = 87 (up) + 164 (jump) + 75 (up) = 326 Total Head Movement Comparison 700 350 0 696 FCFS 241 SSTF 287 SCAN 326 C-LOOK
Ch 10

RAID Levels

RAID 0 Striping A1 A2 A3 A4 No redundancy! RAID 1 Mirroring A A' B B' 50% overhead, full redundancy RAID 5 Distributed Parity A1 B1 Cp A2 Bp C2 Ap B2 C1 Disk 1 Disk 2 Disk 3 Parity rotates, can lose 1 disk RAID 6 Dual Parity P + Q XOR Tolerates 2 failures RAID 10 (1+0) — Production Favorite Mirror first, then stripe across mirror sets A A' Mirror 1 B B' Mirror 2 C C' Mirror 3 + + MTTDL: 57,000 yr (100K²)/(2×10)
Ch 10

Practice: RAID Capacity & Fault Tolerance

✎ Practice — You have 8 disks, each 2 TB. Raw total = 16 TB.

Calculate usable capacity and fault tolerance for each RAID level:

RAID LevelFormulaUsable CapacityFault Tolerance
RAID 08 × 2 TB16 TB (100%)None — any disk fails, all data lost
RAID 1(8 / 2) × 2 TB8 TB (50%)1 failure per mirror pair
RAID 5(8 − 1) × 2 TB14 TB (87.5%)1 disk failure
RAID 6(8 − 2) × 2 TB12 TB (75%)2 disk failures
RAID 10(8 / 2) × 2 TB8 TB (50%)1 per mirror pair (up to 4)

For a database server, which RAID?RAID 10

Best random write performance + good redundancy

Ch 10

Swap Space Management

Physical Memory (RAM) P1 P2 P3 (active) Limited capacity, fast swap out swap in Swap Space (Disk) P4 pages P5 pages P1 pages Extends virtual memory beyond RAM Swap Space Location Dedicated partition — faster, no FS overhead Swap file — flexible size, easier to manage Linux: both supported, use swapon / swapoff Windows: pagefile.sys (swap file) How Much Swap? Traditional rule: 2× RAM Modern (with lots of RAM): = RAM or less Too much swapping = thrashing (system spends all time swapping, no real work) SSD swap >> HDD swap (random access matters most)
Ch 10

NVM & Flash Storage

NAND Flash Architecture SSD Controller NAND Chip 0 NAND Chip 1 NAND Chip 2 NAND Chip 3 FTL (Flash Translation Layer) DRAM Cache Wear-leveling engine Write Limitation Cannot overwrite — must erase first Erase unit = block (128-512 pages) Page = 4-16 KB, Block = 256 KB - 4 MB FTL — Flash Translation Layer Maps logical blocksphysical pages Like a page table for the SSD Wear Leveling Spread writes evenly across all cells ~3000 P/E cycles (TLC) Garbage Collection Reclaim invalid pages Copy valid → erase block TRIM command helps NVMe: PCIe-attached SSD • 64K queues × 64K depth • ~3500 MB/s read • ~500K IOPS
11

File-System Interface

Files • Directories • Access Methods • Protection

Ch 11

File = Named Collection on Disk

Namereport.pdf Size2.4 MB Owneralice Permsrwxr-x--- TimeApr 14 2026 Locationinode #4872 Operations create • open • read • write • seek • close • delete

Open-File Table

OS caches metadata of open files in memory to avoid repeated disk lookups

System-wide Open File Table fd file ptr open count inode 3 offset 1024 2 #4872 5 offset 0 1 #9201
Ch 11

Access Methods

Sequential

1 2 3 4 5 read in order →

Read/write one by one

Like a tape

Direct (Random)

1 2 3 4 5 read(4)

Jump to block n

Fixed-length records

Indexed

Index key → blk "foo" → 4 Block 4

Key lookup → block

Index in memory

Ch 11

Directory Structures

Single-Level a b c d e Name conflicts! All files in one directory Two-Level Root User1 User2 No grouping within user Tree (most common) / bin home etc alice bob Absolute & relative paths Acyclic Graph / A B file Hard / soft links
Ch 11

Hard Links vs Symbolic Links

Hard Link

fileA.txt fileB.txt inode #4872 link count = 2 data blocks + Same inode, same data + Delete one, other still works - Cannot cross filesystems - Cannot link to directories

Symbolic (Soft) Link

shortcut.txt original.txt inode (symlink) "/path/original.txt" follows path inode #4872 link count = 1 data blocks + Can cross filesystems + Can link to directories - Dangling if target deleted - Extra indirection (slower)
Ch 11

Unix File Protection

-rwxr-x--x  1  alice  students  2.4M  report.pdf
type
- = file
d = dir
Owner
rwx = 7
Group
r-x = 5
Other
--x = 1
Octal Notation
chmod 751
Permission Bits
r = 4
read
w = 2
write
x = 1
execute
chmod 751 report.pdf chown alice:students report.pdf
Ch 11

Practice: File Permissions

Practice r=4 w=2 x=1 0=none Q1: Convert -rw-r----- to octal - rw- = 6 r-- = 4 --- = 0 → 640 Q2: What does chmod 755 mean in rwx? 7 = rwx 5 = r-x 5 = r-x → rwxr-xr-x Q3: File permissions 644, owner=alice. Can user bob (group=staff, file group=staff) write? 6=rw- (owner) • 4=r-- (group) • 4=r-- (other) Bob is in group staff → group permission = r-- (read only) Answer: No, bob cannot write. Q4: You run chmod 4755 script.sh. What does the 4 do? SUID bit (Set User ID) When executed, runs with the file owner’s privileges, not the caller’s. Shown as -rwsr-xr-x (note the s in owner execute)
Ch 11

Mounting & File Sharing

Mount Point

/ home etc mnt USB Drive (FAT32) photos/ docs/ mount /dev/sdb1 /mnt Attach any FS to any directory Transparent to applications /etc/fstab — auto-mount at boot

File Sharing (NFS)

Client mounts remote dir Server exports directory RPC Consistency Semantics Unix: writes visible immediately to all Session (AFS): visible on close Sharing Challenges Concurrent access → locking needed Different user IDs across machines Network failures → stateless protocol (NFSv3)
Ch 11

Access Control Lists (ACL)

Traditional rwx Limitation -rwxr-x--- alice staff file.txt What if Bob (not in staff) needs read access? Only 3 categories: owner / group / other Not fine-grained enough! ACL Solution $ getfacl file.txt user::rwx user:bob:r-- group::r-x other::--- Feature Traditional rwx ACL Granularity 3 categories only Per-user, per-group Complexity Simple More complex Storage 9 bits in inode Extended attributes Used in All Unix/Linux NTFS, ext4, macOS, NFSv4
12

File-System Implementation

Layered Structure • Allocation • Free Space • Journaling

Ch 12

Layered File System Architecture

Application Programs Logical File System Metadata, directories, protection, FCB/inode File-Organization Module Logical → physical blocks, free-space management Basic File System Generic block I/O, buffer cache I/O Control (Device Drivers) Translates to hardware commands Hardware: Disk / SSD / RAID
Ch 12

Virtual File System (VFS)

Process A Process B Process C System Call Interface: open() read() write() close() VFS — Virtual File System Uniform API • vnode interface • filesystem-independent operations ext4 Linux default XFS High performance FAT32 USB drives NFS Network SSD (/dev/sda) HDD (/dev/sdb) USB (/dev/sdc) Remote server Same open/read/write works on any filesystem Applications never know which FS they're using
Ch 12

Directory Implementation

Linear List

Directory File filename inode # readme.md 4201 main.c 4205 data.csv 4210 test.py 4218 ? ? ? + Simple to implement - Linear search: O(n) - Slow for large directories Can sort for binary search O(log n)

Hash Table

hash("test.py") → slot 2 Hash Table [0] readme.md [1] main.c [2] test.py [3] data.csv [4] (empty) [5] (empty) + O(1) average lookup! - Hash collisions → chaining - Fixed table size issue ext4 uses HTree (B-tree + hash)
Ch 12

Allocation: Contiguous

Disk Blocks 0 1 2 3 4 5 6 7 8 9 10 File A: start=2, len=4 File B: start=8, len=3 + Best sequential & random perf + Simple: just (start, length) - External fragmentation - File size must be known at creation - Files can't grow easily
Fragmentation Problem hole hole hole hole New file needs 5 blocks — won't fit contiguously! Modern Solution: Extents File = one or more contiguous chunks (extents) Extent 1 ... Extent 2 Used by ext4, NTFS, XFS, Btrfs
Ch 12

Allocation: Linked & FAT

Linked Allocation Block 9 next: 16 Block 16 next: 1 Block 1 next: nil + No external fragmentation - No random access (must traverse) - Broken pointer = data loss FAT (File Allocation Table) blk# 0 1 2 ... 9 16 next - EOF - ... 16 1
Indexed (inode) inode mode, uid, size... direct 0 direct 1..11 single indirect double indirect triple indirect data 12 blocks idx data 1K blks 1M blks triple: 1G blks Max file ~ 4 TB (4KB blocks) + Random access, + No ext. fragmentation Small files fast (direct), large files scale (indirect)
Ch 12

Practice: inode Max File Size

Practice Block size = 4 KB, pointer size = 4 bytes. inode has 12 direct, 1 single indirect, 1 double indirect, 1 triple indirect. Calculate the maximum file size. inode mode, uid, size... direct 0..11 single indirect double indirect triple indirect Pointers per block = 4 KB / 4 B = ? Pointers per block = 4 KB / 4 B = 1024 = 1K Direct: 12 × 4 KB = ? Direct: 12 × 4 KB = 48 KB Single: 1024 × 4 KB = ? Single: 1024 × 4 KB = 4 MB Double: 1024² × 4 KB = ? Double: 1024² × 4 KB = 4 GB Triple: 1024³ × 4 KB = ? Triple: 1024³ × 4 KB = 4 TB Total max = 48 KB + 4 MB + 4 GB + 4 TB ≈ 4 TB Capacity per Level 48 KB (direct) 4 MB (single) 4 GB (double) 4 TB (triple) — dominates!
Ch 12

Free Space & Crash Recovery

Bitmap (most common)

Block: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Bit: 0 0 1 1 1 0 0 0 1 1 0 0 0 0 = free = occupied 1TB disk, 4KB blocks: bitmap = 32 MB

Also: Linked list (no waste), Grouping, Counting (start+count for contiguous runs)

Journaling (Write-Ahead Log)

1 Begin TX 2 Write to journal 3 Commit 4 Apply to disk 5 Free log On crash: replay journal Recovery in seconds vs. fsck taking hours Used by ext3/ext4, NTFS, XFS, HFS+
Ch 12

Log-Structured File System (LFS)

Traditional FS Problem Small writes = many random seeks Update inode + data + bitmap + dir → 4 seeks HDD random write: ~200 IOPS LFS Idea Buffer all writes, flush as one sequential log All writes become sequential → fast! HDD sequential write: ~100 MB/s Disk = One Big Log Segment 1 inode+data+dir Segment 2 inode+data+dir Segment 3 inode+data+dir Segment 4 inode+data+dir WRITE → (free space) Inode Map inode# → current location in log (also in log!) Garbage Collection (Cleaner) Compact live data, reclaim stale segments + Write throughput: 10x improvement + Crash recovery: just replay log tail - Random reads: must look up inode map Used by: F2FS (Flash), WAFL (NetApp), ZFS (copy-on-write)
13

I/O Systems

Hardware • Polling • Interrupts • DMA

Ch 13

I/O Hardware Architecture

CPU Executes I/O instructions Memory System Bus (PCIe) Disk Controller SATA / NVMe USB Controller Keyboard, Mouse Network (NIC) Ethernet / WiFi GPU Display HDD SSD Device Registers: data-indata-outstatuscontrol • accessed via I/O ports or memory-mapped I/O
Ch 13

Polling vs Interrupts vs DMA

Polling CPU busy-waits checking status Still checking... still checking... Done! Transfer 1 byte Busy-wait again... CPU does nothing but loop & check + Simple, low latency + OK for fast devices - Wastes CPU cycles - Terrible for slow devices Interrupts CPU does other work Still working on tasks... IRQ! Handle interrupt Process data, return Resume other work CPU free until device signals + CPU efficient + Good for slow devices - Context switch overhead - CPU still moves each byte DMA CPU: setup DMA command CPU does other work (DMA transfers data directly) IRQ! Transfer complete CPU processes result DMA Controller Device ↔ Memory directly + CPU barely involved + Best for bulk data + Essential for disk & network - Setup overhead
Ch 13

Kernel I/O Subsystem

Scheduling Per-device request queue Reorder for efficiency (disk scheduling algorithms) Priority • Fairness • QoS Buffering Cope with speed mismatch Cope with size mismatch Copy semantics Double buffering Caching Copy on faster storage Key to performance! Unified buffer cache Buffer + Page cache merged Spooling Queue for exclusive devices (e.g. printer queue) Error Handling Retry transient failures Return error codes System error logs I/O Protection All I/O instructions are privileged Must go through syscalls Blocking I/O (process waits) • Non-blocking (returns immediately) • Async (signal on complete)
Ch 13

Life Cycle of an I/O Request

User Process read(fd, buf, n) 1 Syscall trap to kernel 2 Kernel I/O buffer cache check Cache HIT fast return! 3 miss Device Driver build I/O command 4 Controller DMA transfer 5 Disk / SSD seek + read sectors 6 IRQ Handler data → buffer 7 Wake up return data KERNEL SPACE Process blocks at step 3 • Wakes at step 7 • Cache hit skips steps 3-7 entirely
Ch 13

Blocking vs Non-blocking vs Async I/O

Blocking (Synchronous) Process runs... read() called BLOCKED (waiting for I/O) data ready, resume Process continues Simple to program Non-blocking (Returns immediately) Process runs... read() → returns EAGAIN Process does other work read() again → EAGAIN more work... read() → got data! Must poll repeatedly Used for UI, network Asynchronous (Signal when done) Process runs... aio_read() → returns now Process works freely no polling needed Signal/callback: done! Process uses data Best CPU utilization Used for high-perf servers Blocking = simple • Non-blocking = responsive • Async = scalable
Ch 13

STREAMS Architecture

Full-Duplex Communication Channel (System V UNIX) User Process read/write Stream Head Module A (e.g. protocol) Module B (e.g. filter) Device Driver End Hardware Device ↓ downstream upstream ↑ Key Concepts Messages flow through a pipeline Each module has read & write queues Modules are stackable at runtime Advantages + Modular: add/remove processing layers + Reusable modules across drivers Example Use Network: IP module → TCP module → NIC driver Linux uses a different approach (socket layer), not STREAMS
Ch 13

Improving I/O Performance

I/O is the major bottleneck in most systems Reduce CPU Load Use DMA over polling Offload to smart controllers Reduce interrupt frequency (coalesced interrupts) NIC offload: checksum, TCP Reduce Copies Traditional read(): device → kernel bufuser buf Zero-copy: skip user buffer device → kernel buf → socket sendfile(), splice(), mmap() Smart Scheduling Buffer cache / page cache Read-ahead (prefetch) I/O scheduler reordering Async I/O (io_uring) Overlap compute + I/O Device Performance Spectrum Keyboard 10 B/s WiFi 100 MB/s SATA SSD 550 MB/s NVMe SSD 3,500 MB/s PCIe 5 SSD 12,000 MB/s RAM 50 GB/s

Part 4: Key Takeaways

Ch 10: Mass Storage HDD: seek + rotation = 99% of access time SSD: 1000x faster random I/O, no scheduling needed Scheduling: SSTF/LOOK (light), SCAN (heavy load) RAID 10: best for production databases MTTDL dramatically improved by mirroring Ch 11: FS Interface File = named + metadata + data blocks Access: sequential, direct, indexed Directories: tree structure + links Protection: rwx × owner/group/other NFS for remote file sharing Ch 12: FS Implementation Layered: app → logical FS → basic FS → I/O Alloc: contiguous vs linked vs indexed (inode) Free space: bitmap most common Journaling = fast crash recovery (seconds) VFS unifies multiple FS types Ch 13: I/O Systems Polling: simple but wastes CPU Interrupts: efficient, CPU free until signal DMA: essential for bulk data transfer All I/O instructions are privileged Kernel: scheduling, buffering, caching, spooling

Thank You!

Part 4: Storage Management

Chapters 10–13 • Operating System Concepts, 9th Edition