Professor: Bernard Wong | Term: Winter 2025

Lecture 1

Question

What is an operating system?

  • Layer between applications and hardware (gcc <> OS <> Hardware)
  • Makes hardware useful to the programmer
  • Usually provides abstractions for applications
    • Manages and hides details of hardware
    • Accesses hardware through low/level interfaces unavailable to applications
  • Often: Provides protection
    • Prevents one process/user from clobbering another

Primitive Operating systems are just a library of standard services (no protection).

Simplifying assumptions:

  • This system only runs one program at a time (hence no need for protection).
  • There are no bad users or programs (this is a bad assumption)

Problem: Poor utilization

  • of hardware
  • of human user

Multitasking We can run more than one process at once. When one process blocks, run another process.

Question

What can ill-behaved process do?

Go into infinite loop and never relinquish CPU. Scribble over other processes’ memory to make them fail

OS provides mechanisms to address these problems

  • Preemption (take CPU away from looping process)
  • Memory protection (Protect process’s memory from one another)

Multi-user OSes

Many OSes use protection to serve distrustful users/apps

Idea: With users, system not times slower

  • User demand for CPU is bursty

Question

What can go wrong?

  • Users are gluttons, use too much CPU (need policies)
  • Total memory usage greater than in machine (must virtualize)
  • Super-linear slowdown with increasing demand (thrashing)

Protection

  • Mechanisms that isolate bad programs and people
  • Pre-emption:
    • Give application a resource, take it away if needed elsewhere
  • Interposition/meditation:
    • Place OS between application and “stuff”
    • Track all pieces that application allowed to use
    • On every access, look in table to check that access is legal
  • Privileged & Unprivileged modes in CPUs
    • Applications unprivileged
    • OS privileged
    • Protection operations can only be done in privileged mode

Most software runs as user-level processes

OS kernel runs in privileged mode

  • Creates/deletes processes
  • Provides access to hardware

Question

So how does user “talk” to kernel?

OS provides a system call interface to safely communicate with kernel.

Applications can invoke kernel through system calls

  • Special instruction transfers control to kernel
  • Which dispatches to one of few hundred syscall handlers

Lecture 2

System Calls: Do things app. Can’t do in unprivileged mode.

Kernel supplies well-defined system call interface.

  • Applications set up syscall arguments and trap to kernel.
  • Kernel performs operation and returns result

Higher-level functions are built on the syscall interface (i.e.printf, scanf)

Processes, threads, locks, file i/o are some examples of abstractions from the operating system.

A process is an instance of a program running.

Modern operating systems can run multiple processes simultaneously.

Multiple processes can increase CPU utilization and reduce latency.

takes 80 seconds, takes 20 seconds.

Running then requires 100 seconds for to complete.

seconds on average.

If we can run and concurrently, we can make finish faster.

runs, then runs for 10 seconds, then runs, then runs for 10 seconds, then completes.

is slower than if it had the whole machine to itself.

seconds on average.

Parallelism is a fact of life. Most computers, laptops and phones are multi-core. Computer with 4 cores can run 4 processes in parallel. Ideally we get 4 times the throughput.

Each process has own view of machine. They have their own address space, own open files, own virtual CPU.

*(char *)0xc000 is different in and

Physical Memory

stackgrows down
heapgrows up
dataglobal variables
textcontains actual code

Every process has something that looks like this. This simplifies the programming model. gcc does not care that firefox is running.

int fork(void);

Create new process that is exact copy of current one. Returns process ID of new process in “parent”. Returns in child. (Ugly fork questions usually appear on the midterm).

int waitpid(int pid, int *stat, int opt

  • Wait for a child process to terminate
  • pid - process to wait for
  • stat - will contain exit value, or signal from child
  • opt - usually or WNOHANG
  • Returns process ID

Lecture 3

We run processes in unprivileged mode. Within our processes we have stacks.

We run close() in userspace system call library. We then load the call param, set the syscall number, and then we run syscall. We then switch to the Kernel stack.

We then save the complete processor state into a trap frame.

We check whether this is an exception, interrupt, or system call (handled by mips_trap)

Once syscall returns, it modifies register values stored in the trap_frame.

if (err) {
	tf->tf_v0 = err;
	tf->tf_a3 = 1; // error
} else {
// Success
	tf->tf_v0 = retval;
	tf->tf_a3 = 0; // no error
}
 
// Advance the PC, to avoid the syscall
tf->tf_epc += 4;

void exit(int status)

Question

What output will be generated by the parent and child processes for the program shown below?

int x;
int main() {
	int rc;
	x = 0;
	rc = fork();
	if (rc == 0) {
		x = 10;
		printf("A: %d\n", x);
	} else {
		printf("B: %d\n", x);
		x = 100;
	}
	printf("C: %d\n");
}

Parent

x=0
rc = pid of child
"B: 0"
x = 100
"C: 100"

Child

rc = 0
x = 10
"A: 10"
"C: 10"

Question

Consider a system with a process, ProcA, that executes the program below. Draw the tree rooted at ProcA representing the processes created by this program. Every node in the tree should represent a process, and there should be an edge from node to node if process creates process . Label each node of the tree with the label of the fork() system call that created it.

int main() {
	fork(); // Label B
	fork(); // Label C
	fork(); // Label D
	return 0;
}
flowchart TD
id1((A)) --> id2((B))
id2((B)) --> id5((C))
id5((C)) --> id7((D))
id2((B)) --> id6((D))
id1((A)) --> id3((C))
id3((C)) --> id8((D))
id1((A)) --> id4((D))
int main(int argc, char *argv[]) {
	int status; 
	int x = 1;
	status = fork();
	if (status == 0) {
		x += 1;
	} else { 
		 x *= 2;
		 waitpid(status, NULL, 0);
	}
	printf("%d\n", x);
 
	status = fork();
	if (status == 0) {
		x *= 2;
	} else {
		x *= 3;
	}
	printf("%d\n", x);
}
 

Parent

x = 1
"1"
spawn child
x = 2
waitpid - wait for child process to complete (__ symbol)
"2"
Spawn another child

Child

x = 2
"2"
spawn child of child
x=6
"6"
__ dotted line to waitpid symbol of parent. Can continue from that point on

Child of Child

x = 4
"4"

Second child

x = 6
"6"

Second child’s child

x=4
"4"

Lecture 4

Sketch of previous example.

One possible output: 1, 2, 4, 6, 2, 4, 6

One not possible output: 1, 6, 2, 4

We don’t expect the parent to print before the children finish waitpid.

Another possible output: 1, 2, 6, 2, 6, 4, 4 (last 4 is grandchild)

Another possible output: 1, 2, 6, 4, 4, 6, 4

{c} void exit(int status)

  • Current process ceases to exist
  • status shows up in waitpid
  • By convention, status of is success, non-zero error

{c} int kill(int pid, int sig)

  • Sends signal sig to process pid
  • SIGTERM most common value, kills process by default
  • SIGKILL stronger, kills process always

{c} int execve(char *prog, char **argv, char **envp);

  • Execute a new program
  • prog - a full pathname of program to run
  • argv - argument vector that gets passed to main
  • envp environment variables

It is generally called through a wrapper functions.

  • {c} int execvp(char *prog, char **argv); Search PATH for prog, use current environment
  • {c} int execlp(char *prog, char *arg, ...); List arguments one at a time, finish with NULL.
pid_t pid; char **av;
void doexec() {
	execvp(av[0], av);
	perror(av[0]);
	exit(1);
	 } 
	  
	/* ... main loop: */
	for (;;) {
		parse_input(&av, stdin);
		switch (pid = fork()) {
		case -1:
			perror("fork"); break;
		case 0:
			doexec();
		default:
			waitpid(pid, NULL, 0); break;
	}
}

File descriptors:

Each process has its own file descriptor table. What is stored inside is a pointer to the global open file table (with an offset) which has information necessary for the operating system to operate on the specific file.

Special file descriptors

  • 0: STDIN
  • 1: STDOUT
  • 2: STDERR

{bash} cat file.txt | wc - 1

  • STDOUT of {bash}cat is sent to the STDIN of {bash}wc

{c} int dup2(int oldf, int newfd);

  • Closes {c}newfd, if it was a valid descriptor
  • Makes {c}newfd and exact copy of {c}oldfd
  • Two file descriptors will share the same offset

Example: redirsh.c

  • Look that reads a command and executes it
void doexec (void) {
	int fd;
	if (infile) { /* non-NULL for "command < infile" */
		if ((fd = open(infile, O_RDONLY)) < 0) {
			perror(infile);
			exit(1);
		}
		if (fd != 0) {
			dup2(fd, 0);
			close(fd);
		}
	}
	
	/* ... do same for outfile→fd 1, errfile→fd 2 ... */
	execvp (av[0], av);
	perror (av[0]);
	exit (1); 
}

Pipes

{c} int pipe(int fds[2]);

  • Returns two file descriptors in {c} fds[0] and {c} fds[1]
  • Writes to {c} fds[1] will be read on {c} fds[0]
  • When last copy of {c} fds[1] is closed, {c} fds[0] will return EOF
  • Returns 0 on success, -1 on error

Operations on pipes:

  • read/write/close - as with files
  • When {c} fds[1] closed, {c} read(fds[0]) returns 0 bytes
  • When {c} fds[0] closed, {c} write(fds[1]):
    • Kills process with SIGPIPE
    • Or if signal ignored, fails with EPIPE

This is useful when creating forks. Parent can write via ingress and child can later read it via egress. This is only possible because they share the same file descriptor in the global open file table.

Example

pipesh.c: Sets up pipeline command1 | command2 | command 3 ...

Lecture 5

Pipes aren’t very useful in just one process. Become useful once we fork and copy the file descriptors to the children.

void doexec(void) {
	while(outcmd) {
		int pipefds[2]; pipe(pipefds);
		switch (fork()) {
		case -1:
			perror("fork"); exit(1);
		case 0: // fork runs here
			dup2(pipefds[1], 1); // ingres, stdout
			// replacing stdout with fds[1]
			// cmd out goes to right side of pipe instead of stdout
			// reference count is 3
			close(pipefds[0]); close(pipefds[1]);
			// reference count is 1
			outcmd = NULL;
			break
		default:
			dup2(pipedfs[0], 0); // replace stdin with read side of pipe. fds[0] points to stdin, stdin points to where fds[0] used ot point
			close(pipefds[0]); close(pipefds[1]);
			parse_input(&av, &outcmd, outcmd);
			break;
		}
	}

stdout of child sends to stdin of parent.

Question

Why fork?

Most calls to fork followed by execve

Occasionally useful to fork one process

  • Creates one process per core to serve clients

Simplicity of interface

  • fork requires no arguments at all

Kernel view of process

  • OS keeps data structure for each procedure
    • Process Control Block (PCB)
PCB
Process State
Process ID
User id
Program counter
Registers
Address space
Open files
  • Tracks state of process
    • Running, ready, blocked
  • Includes information necessary to run
    • Registers, vm mappings
    • Open files
  • Other data about the process
    • Credentials, controlling terminal, etc.
flowchart LR
new --> ready
ready --> running
waiting --> ready
running --> terminated
running --> waiting
running --> ready

Process can be in one of several states

  • New and terminated at beginning and end of file
  • Running - Currently executing
  • Ready - Can run, but kernel has chosen a different process to run
  • Waiting - Needs async event to proceed
  • Preemption causes the transition from running to ready (instead of interrupt)

Question

Which process should kernel run?

This is a scheduling problem.

Question

Scan process table for first runnable?

This is expensive with weird priorities.

Question

FIFO/Round-Robin?

When a process is ready to run, it goes at the end of a ready queue. Take the top off once the CPU has space. Runs until it no longer can.

FIFO has no preemption. Round-robin has preemption, process can only run for a certain amount of time (scheduling quanta). Gets added to the back of the queue once its done.

Can preempt a process when kernel gets control.

Running process can vector control to kernel. May put current process to sleep. Make make other process runnable

Periodic timer interrupt. Schedule another process if the running process used up quantum.

Changing running process is called a context switch.

Lecture 6

Once a timer interrupt happens, we need a trap frame to save a snapshot of the current state on the kernel stack (to make sure we continue from the right spot).

We call mips_trap() (tells us an interrupt happens), then go to the interrupt handler, and timer interrupt handler.

We then call a switch frame after running more stuff after mips_trap() (stores state of process at the current time). Similar to trap frame but not the same. Can only have one switch frame per kernel stack, but can have multiple trap frames.

The Kernel stack for process 2 ended with a switch frame (we use this to continue from where we left off). Most common use of trap frame is from transition from user space to kernel (but not always). The second kernel stack “pops off” (because there is nothing left to do) to allow us to go back to user space so the thread in process 2 resumes. Process 2 keeps running (until it gets another interrupt).

If we kept getting interrupts we’d get too many trap frames. We want the interrupts to be off as long as we are handling them. The moment the interrupt handler is done running, we can turn them back on.

Context switching is very machine dependent. But usually includes:

  • Save program counter and integer registers
  • Save floating point or other special registers
  • Save condition codes
  • Change virtual address translations

Non-negligible cost

  • Save/Restore floating point registers
    • Optimization: only save if process used floating point
  • May require flushing TLB
    • Hardware Optimization 1: Don’t flush kernel’s own data from TLB
    • Hardware Optimization 2: Use tag to avoid flushing any data
  • Usually causes more cache misses

Lecture 7

argstart = DMPA2VA(PMap_Translate(...))

argstart is a kernel address

Parameter ofPMap_Translate = user space address, and PMap_Translatetranslates it to a physical address.

Direct Mapping Physical Address 2 Virtual Address

Threads

For each process we have code, data, files, registers, stack. A process has a single sequential flow (one thread).

A thread is a schedulable execution context.

Multithreaded processes share the address space and code, data, files. But they do not share registers and stack (execution context).

Question

Why threads?

Most popular abstraction for concurrency. Lighter-weight abstraction than processes. All threads in one process share memory, file descriptors etc.

Allows one process to use multiple CPUs are cores.

Allows program to overlap I/O and computation

  • Same benefit as OS running emacs and gcc simultaneously

Example

Threaded web server services clients simultaneously

Thread creation is cheaper than processor creation.

Most kernels have threads.

Example

In Google Chrome, each tab is a separate process (not thread), allowing for isolated environments.

{cpp} int pthread_create(pthread_t *thr, pthread_attr_t *attr, void *(*fn) (void *), void *arg);

  • Creates a new thread identified by thr with optional attributes, run fn with arg

{cpp} void pthread_exit(void *return_value);

  • Destroy current thread and return a pointer

{cpp} int pthread_join(pthread_t thread, void **return_value);

  • Wait for thread to exit and receive the return value. Thread version of waitpid

{cpp} void pthread_yield();

  • Tell the OS scheduler to run another thread or process

Plus lots of support for synchronization.

One-to-one mode: kernel threads are one-to-one with user thread. Every thread has a user space component and a kernel component. The same user thread is mapped to the same kernel thread.

Can implement pthread_create as a system call.

To add pthread_create to an OS:

  • Start with process abstraction in kernel
  • pthread_create like process creation with features stripped out
    • Keep same address space, file table, etc., in new process
    • rfork/clone syscalls actually allow individual control (creates kernel thread)

This is faster than a process, but still heavy. We still need syscalls, these are expensive.

Limitations of kernel-level threads

Every thread operation must go through kernel

  • Syscall takes 100 cycles, function call 2 cycles
  • Result: Threads slower when implemented in kernel

One-size fits all thread implementation

  • Kernel threads must please all ‘people’

General heavy-weight memory requirements

  • Other data structures designed for heavier-weight processes

-to-one model: One kernel thread per process. Meaning multiple user threads per kernel thread.

Implementing user-level threads

  • Allocate a new stack for each pthread_create
  • Keep a queue of runnable threads
  • Replace booking system calls (read/write)
    • If operation would block, switch and run different thread
  • Schedule periodic timer signal (setitimer)
    • Switch to another thread on timer signals
  • Multi-threaded web server
    • Thread calls read to get data from remote web browser

Lecture 8

Limitations of user-level threads

  • Can’t take advantage of multiple CPUs or cores
  • Blocking system call blocks all threads
  • A page fault blocks all threads
  • Possible deadlock if one thread blocks another

-to-one model isn’t entirely pointless. Useful when program is naturally multi-threaded. We don’t need a lot of cores but need threads. With one-to-one this results in many context switches.

User threads implemented on kernel threads

  • Multiple kernel-level threads per process
  • thread_create, thread_exit still library functions as before

Sometimes called threading

  • Have user threads per kernel threads

Many of the same problems as threads

  • Blocked threads, deadlock

Hard to keep same number of kernel threads as available CPUs

  • Kernel knows how many CPUs available
  • Kernel knows which kernel-level threads are blocked
  • Tries to hide these things from applications for transparency

Kernel doesn’t know relative importance of threads

  • Might preempt kernel thread in which library holds important lock

Lesson

  • Threads are best implemented as a library
  • Shouldn’t be dissuaded from using threads

The part of a concurrent program in which a shared object is accessed is called a critical section

Question

What happens if several threads try to access the same global variable or heap object at the same time

int total = 0;
void add() {
	for (int i = 0; i < N; i++) {
		total++;	
	}
}
 
void sub() {
	for (int i = 0; i < N; i++) {
		total--;
	}
}
int total = 0;
void add() {
	for (int i = 0; i < N; i++) {
		// total++
	lw r9, 0(r8)
	add r9, 1
	sw r9, 0(r8)
	}
}
 
void sub() {
	for (int i = 0; i < N; i++) {
		//total--;
		lw r0, 0(r8)
		sub r9, 1
		sw r9, 0(r8)
	}
}

Schedule 1:

Thread 1 runs all of add() first, then thread 2 runs all of sub(). Increment completed then decrement total = 0

Schedule 2:

Commands from add() and sub() execute back and forth.

Thread 1Thread 2
lw r9, 0(r8)
lw r9, 0(r8)
add r9, 1
sub r9, 1
sw r9, 0(r8)
sw r9, 0(r8)

Each thread has its own context. These are different r9’s. We get then . The value of total is

Schedule 3: (executing concurrently more than one core)

Thread 1Thread 2
lw r9, 0(r8)lw r9, 0(r8)
add r9, 1sub r9, 1
sw r9, 0(r8)
sw r9, 0(r8)

Thread 1 result is kept, ignoring the results from thread 2 total = 1.

These are race conditions.

To prevent race conditions, we can enforce mutual exclusion on critical sections in the code. The critical section in the code above is total++ and total--.

First solution attempt

int total = 0
bool lock = false;
void add() {
	for (int i = 0; i < N; i++) {
		while (lock == true);
		lock = true
 
		total++;
		lock = false;
	}
}
 
void sub() {
	for (int i = 0; i < N; i++) {
		while (lock == true);
		lock = true;
	
		total--;
		lock = false;
	}
}

This does not work. Results in back-and-forth context switching. We are testing whether the lock is available, and then setting the lock. This is the issue.

Lecture 9

Desired properties of solution

  • Mutual exclusion
  • Progress (if nothing is in the critical section, process trying to enter will get in)
  • Bounded waiting
  • Note progress vs. bounded waiting

Peterson’s solution

// Only two threads, t is current thread id (0 or 1)
 
int n_turn = 0;
bool w[2] = {false,false};
int total = 0;
 
void add() {
	for (int i = 0; i < N; i++) {
		w[t] = true;
		n_turn = t;
		while(w[1-t] && n_turn == t);
		total++;
		w[t] = false;
	}
}
 
// sub is the same except total--

Question

Does Peterson’s solution work?

  • Mutual exclusion - can’t both be in critical section
  • Progress - If not in critical section, can’t block
  • Bounded waiting - similar argument to progress

Peterson expensive, only works for 2 processes. Can generalize to , but for some fixed .

The implementation requires Sequential Consistency.

Program A

int flag1 = 0, flag2 = 0
 
void p1(void *ignored) {
	flag1 = 1;
	if (!flag2) {
		critical_section_1();
	}
}
 
void p2(void *ignored) {
	flag2 = 1;
	if (!flag1) {
		critical_section_2();
	}
}
 
int main() {
	pthread_t tid;
	pthread_create(tid, NULL, p1, NULL);
	p2();
	pthread_join(tid);
}

Question

Can both critical sections run?

Program B

int data = 0, ready = 0;
 
void p1(void *ignored) {
	data = 2000;
	ready = 1;
}
 
void p2(void *ignored) {
	while (!ready)
		;
	use(data);
}

Question

Can use be called with value 0?

Program C

int a = 0, b = 0;
 
void p1(void *ignored) {
	a = 1;
}
 
void p2(void *ignored) {
	if (a == 1) {
		b = 1;
	}
}
 
void p3(void *ignored) {
	if(b == 1) {
		use(a);
	}
}

Question

Can use() be called with value 0?

The answer for all above is “We don’t know”.

It depends on the machine that we use.

Definition

Sequential Consistency: The result of execution is as if all operations were executed in some sequential order, and the operations of each processor occurred in the order specified by the program.

Boils down to:

  1. Maintaining program order on individual processors
  2. Ensuring write atomicity

Without Sequential Consistency, multiple CPUs can be worse than preemptive threads.

Not all hardware supports sequential consistency because:

  • Can’t re-order overlapping write operations
  • Reduces opportunities for data prefetching
  • Makes cache coherence more expensive

Overall it thwarts hardware optimizations

x86 supports multiple consistency/caching models.

Choices include:

  • WB: Write-back caching
  • WT: Write-through caching
  • UC: Uncacheable
  • WC: Write-combining

x86 atomicity

lock - prefix makes a memory instruction atomic

  • Locks bus for duration of instruction (expensive)
  • Can avoid locking if memory already exclusively cached
  • All lock instructions totally ordered
  • Other memory instructions cannot be re-ordered with locked ones

xchg - exchange instruction is always locked

cmpxchg - compare and exchange is also locked

Special fence instructions can prevent re-ordering

Lecture 10

We often reason about concurrent code assuming sequential consistency, but for low-level code it’s important to know the memory model.

Lock acquire and release with xchg

Acquire(bool *lock) {
	while (Xchg(true, lock) == true);
}
Release(bool *lock) {
	*lock = false; // give up lock
}

If xchg returns true, the lock was already set, and we must continue to loop. If it returns false, the lock was free and we have now acquired it. This is known as a spin lock.

This implementation is almost correct without sequential consistency.

Can be fixed by adding a fence or replacing *lock = false with xchg(false, lock).

In addition to spin locks, most operating systems also provide locks (a lock is known as a mutex).

Locks are used to enforce mutual exclusion

mutex_lock(lock);
// --- critical section ---
mutex_unlock(lock);

Spinlocks spin, locks block:

  • A thread that calls spinlock_acquire spins until the lock can be acquired
  • A thread that calls mutex_lock blocks until the lock can be acquired.

Sometimes a thread will need to wait for something else

  • Lock to be released by another thread
  • Data from a slow device
  • Input from a keyboard

When a thread blocks, it stops running:

  • The scheduler chooses a new thread to run
  • A context switch from the blocking thread to the new thread occurs

Eventually, a blocked thread is signaled and awakened by another thread

Wait channels are used to implement thread blocking

  • Manages a list of sleeping threads
  • Abstracts details of the thread scheduler

void WaitChannel_Lock(WaitChannel *wc)

  • Lock wait channel operations
  • Prevents a race between sleep and wake

void WaitChannel_Sleep(WaitChannel *wc)

  • Blocks calling thread on wait channel wc

void WaitChannel_WakeAll(WaitChannel *wc)

  • Unblocks all threads sleeping on the wait channel

void WaitChannel_Wake(WaitChannel *wc)

  • Unblocks one threads sleeping on the wait channel

Function names in this lecture all based on pthreads

int pthread_mutex_init(pthread_mutex_t *m, pthread_mutexattr_t attr)

  • Initialize a mutex

int pthread_mutex_destroy(pthread_mutex_t *m)

  • Destroy a mutex

int pthread_mutex_lock(pthread_mutex_t *m)

  • Acquire a mutex

int pthread_mutex_unlock(pthread_mutex_t *m)

  • Release a mutex

int pthread_mutex_trylock(pthread_mutex_t *m)

  • Attempt to acquire a mutex

All global data should be protected by a mutex. Global accessed by more than one thread, at least one write. Exception is initialization, before exposed to other threads. This is the responsibility of the application writer.

Compiler/Runtime Contract (C, Java, Go)

Assuming no data races, the program behaves sequentially consistent.

If using mutexes properly, behaviour should be indistinguishable from Sequential Consistency.

  • Responsibility of the threads package and compiler
  • Mutex is broken if you use properly and don’t see sequential consistency.

OS kernels also need synchronization. Some mechanisms look like mutexes. Interrupts complicate things.

Lecture 11

We don’t always want to use mutex in the kernel.

Producer/Consumer model. Breaking up large amounts of work.

void producer (void *ignored) {
	for (;;) {
		item *nextProduced = produce_item();
		while (count == BUFFER_SIZE)
			// do nothing (spin) - wait for consumer
		buffer[in] = nextProduced
		in = (in + 1) % BUFFER_SIZE;
		count++
	}
}
 
void consumer (void *ignored) {
	for (;;) {
		while (count == 0)
			// do nothing (spin) - nothing to consume
		item *nextConsumed = buffer[out];
		out = (out + 1) % BUFFER_SIZE;
		count--;
		consume_item(nextConsumed);
	}
}

This is incorrect. count is a global variable, we need to provide mutual exclusion.

The entire while loop is the critical section. Lock acquire above the while, lock release after the count**. This still doesn’t solve it. If we start with consumer and count == 0, holds lock and spins. Producer can’t do anything with the lock. Will keep spinning in consumer.

Locking after the while loop still does not solve the problem.

There exists a single-instruction add. i386 allows addl $1,_count. We can implement count++/-- with one instruction.

This is not atomic on multiprocessor. We will experience the exact same race condition. Can make atomic with lock prefix, but lock is expensive.

Need a solution to critical section problem.

Improved producer

mutex_t mutex = MUTEX_INITIALIZER; 
void producer (void *ignored) { 
	for (;;) { 
		item *nextProduced = produce_item (); 
		mutex_lock (&mutex); 
		while (count == BUFFER_SIZE) { 
			mutex_unlock (&mutex); /* <--- Why? */
			thread_yield (); 
			mutex_lock (&mutex); 
		} 
		buffer [in] = nextProduced; 
		in = (in + 1) % BUFFER_SIZE; 
		count++; 
		mutex_unlock (&mutex); 
	} 
}

Improved consumer

void consumer (void *ignored) { 
	for (;;) { 
		mutex_lock (&mutex); 
		while (count == 0) { 
			mutex_unlock (&mutex); /* <--- Why? */
			thread_yield (); 
			mutex_lock (&mutex); 
		} 
		item *nextConsumed = buffer[out]
		out = (out + 1) % BUFFER_SIZE;
		count--;
		mutex_unlock(&mutex);
		consume_item(nextConsumed)
	} 
}

This works, but its slow. We are spinning.

Busy-waiting in application is a bad idea. It’s better to inform scheduler of which threads can run.

This is done with condition variables.

int pthread_cond_init(pthread_cond_t *, ...)

int pthread_cond_wait(pthread_cond_t * c, pthread_mutex+t *m);

  • Automatically unlock and sleep until is signaled
  • Then re-acquire and resume executing

int pthread_cond_signal(pthread_cond_t *c);

  • Signal wakes up one thread (usually the oldest)

int pthread_cond_broadcast(pthread_cond_t *c);

  • Broadcast wakes up all of them

Improved producer

mutex_t mutex = MUTEX_INITIALIZER; 
cond_t nonempty = COND_INITIALIZER;
cond_t nonfull = COND_INITIALIZER;
 
void producer (void *ignored) { 
	for (;;) { 
		item *nextProduced = produce_item (); 
		mutex_lock (&mutex); 
		while (count == BUFFER_SIZE) { 
			cond_wait(&nonfull, &mutex);
		} 
		buffer [in] = nextProduced; 
		in = (in + 1) % BUFFER_SIZE; 
		count++; 
		cond_signal(&nonempty); // wake up threads waiting
		mutex_unlock (&mutex); 
	} 
}

Improved consumer

void consumer (void *ignored) { 
	for (;;) { 
		mutex_lock (&mutex); 
		while (count == 0) { 
			cond_wait(&nonempty, &mutex);
		} 
		item *nextConsumed = buffer[out]
		out = (out + 1) % BUFFER_SIZE;
		count--;
		cond_signal(&nonfull);
		mutex_unlock(&mutex);
		consume_item(nextConsumed)
	} 
}

Always re-check condition on wake-up

Otherwise it breaks with spurious wakeup or two consumers.

Question

Why must cond_wait both release mutex and sleep? Why not separate mutexes and condition variables?

We can end up stuck waiting when bad interleaving. After we unlock in the producer but before we cond_wait, there’s a context switch to consumer. This is very bad. Producer thread may be blocked forever.

Semaphores (Dijkstra)

A Semaphore is initialized with an integer

  • int sem_init(sem_t *s, ..., unsigned int n);

Provides two functions:

  • sem_wait(sem_t *s)
  • sem_post(sem_t *s)

sem_wait will return only more times than sem_post called.

Linux primarily uses semaphores for sleeping locks. Provides elegant solutions for some problems.

Semaphore producer/consumer

Initialize full to 0 (block consumer when buffer empty)

Initialize empty to (block producer when queue full)

void producer(void *ignored) {
	for (;;) {
		item *nextProduced = produce_item ();
		sem_wait(&empty);
		buffer [in] = nextProduced;
		in = (in + 1) % BUFFER_SIZE;
		sem_post(&full);
	}
}
void consumer(void *ignored) {
	for (;;) {
		sem_wait(&full);
		item *nextConsumed = buffer[out];
		out = (out + 1) % BUFFER_SIZE;
		sem_post(&empty); // reate new empty block
		consume_item(nextConsumed);
	}
}

We build a semaphore using a combination of spinlocks and wait channels.

Lecture 12

Semaphore can be 0. Used to block any section from accessing code. Binary semaphores are similar to mutexes. But initialized to an integer allows up to threads to use resources concurrently.

Semaphore implementation

Semaphore_Wait(Semaphore *sem) {
	Spinlock_Lock(&sem->sem_lock);
	while (sem->sem_count == 0) {
		// lock
		WaitChannel_Lock(sem->sem_wchan);
		// release spinlock
		Spinlock_Unlock(&sem->sem_lock);
		// wait channel protected by own lock
		WaitChannel_Sleep(sem->sem_wchan);
		// no locks held
		Spinlock_Lock(&sem->sem_lock);	
	}
	sem->sem_count--;
	Spinlock_Unlock(&sem->sem_lock);
}
 
Sepmaphore_Post(Semaphore *sem) {
	Spinlock_Lock(&sem->sem_lock);
	sem->sem_count++;
	WaitChannel_Wake(sem->sem_wchan);
	Spinlock_Unlock(&sem->sem_lock);
}

If we released the Spinlock before we did WaitChannel_Lock results in a race condition, calling wakeup on nobody.

Midterm Hint:

  • If code was changed around, would it still work? What would the schedule look like (order of operations between threads).

Question

Does swapping the last two lines work?

There’s a chance.

Hand-over-hand locking allows for fine-grained locking

Useful for concurrent data structure manipulation

  • Hold at most two locks: the previous and next locks
  • Locks must be ordered in a sequence

Example

We have locks A, B, C

flowchart LR
A --> B
B --> C

Locking the entire thing would be wasteful of other threads. Hence hand-over-hand locking. We have one lock per entry.

If we want to preserve traversal ordering, normal locking and unlocking doesn’t work. This can lead to race conditions.

lock(A)
// Operate on A
lock(B)
unlock(A) 
// operate on B
lock(C)
unlock(B)
// Operate on C
unlock(C)

This is why we interweave the locking. This preserves the correct traversal since we are locking the next entry before unlocking the current entry.

The semaphore implementation uses hand-over-hand locking.

Monitors: Another synchronization primitive.

Programming language construct

  • Less error prone than raw mutexes
  • A class where only one procedure executes at a time
public class Statistics {
	private int counter;
	public synchronized int get() { return counter; }
	public synchronized void inc() { counter++; }
}

synchronized key word automatically locks object.

Can implement mutex with monitor or vice versa.

Benign Races

In some cases we are okay with races. Sometimes “cheating” buys efficiency.

If we care more about speed than accuracy then this is okay.

if (!initialized) {
	lock(m);
	if (!initialized) { initialize(); initialize = 1; }
	unlock(m);
}

Usually not recommended.

Lecture 13

Can use static methods to detect data races (though not very good).

One of the tools is Eraser. Keeps track of all locks (lockset). One each access, remove any locks not currently held. If lockset becomes empty, abort (introduced a race condition).

Deadlock Problem

mutex_t m1, m2;
 
void f1(void *ignored) {
	lock(m1);
	lock(m2);
	// cs 
	unlock(m2);
	unlock(m1);
}
 
void f2 (void *ignored) {
	lock(m2);
	lock(m1);
	// cs
	unlock(m1);
	unlock(m2);
}

Thread 1 acquires m1, no longer running. context switch to f2, acquires m2, but blocks when trying to acquire m1. Context switch back to f1, tries to acquire m2, gets blocked. This will be stuck forever.

Can only unlock something if we own the lock.

Examples of deadlocks without locks

funcA()
lock(mutex)
while (x == 0) {
	cond_wait(cond_x, mutex) // waits on condx, releases mutex
}
y++
unlock(mutex)
 
funcB()
lock(mutex)
while (y == 0) {
	cond_wait(cond_y, mutex)
}
x++
unlock(mutex)

If the value of x and y are both 0 at the same time the thread would sleep. However a third thread can fix this.

Deadlock with combined mutex/condition variable deadlock. Need to be careful when using library functions.

mutex_t a, b;
cond_t c;
 
lock(a); lock(b); while (!ready) wait(b, c); // releases b, waiting on cond c
unlock(b); unlock(a);
 
 
lock(a); lock(b); ready = true; signal(c); 
unlock(b); unlock(a);

Danger

Dangerous to hold locks when crossing abstraction barriers.

Deadlock conditions (all 4 must be true in order for a deadlock to occur)

  1. Limited access (mutual exclusion)
    1. No locks means no deadlocks (but means race conditions, just as bad as deadlock). Usually can’t eliminate this condition
  2. No preemption
    1. Once resource granted, can’t be taken away. Eliminates deadlocks, but there’s a chance we get a livelock. Everyone is actively doing work, but none of it is useful.
  3. Multiple independent requests
    1. Don’t ask all at once
  4. Circularity in graph of requests
    1. This is the condition we try to remove

Two approaches to dealing with deadlock

  • Pro-active: prevention
  • Reactive: detection corrective action

Can view the system as a graph. Processes and resources are nodes. Resource requests and assignments are edges.

If a graph has no cycles no deadlock.

If graph contains a cycle. Definitely a deadlock if only one instance per resource. Otherwise, maybe deadlock, maybe not.

Lecture 14

Amdahl’s Law

  • : The time one core takes to complete the task
  • : The fraction of the job that must be serial (not parallel)
  • : The number of cores (Suppose this was )

Amdahl’s law places an ultimate limit on parallel speedup.

Cache Coherence Protocols

Each core has its own cache. Creates opportunity for the caches to disagree about memory.

Caches leverage temporal locality. Pulling from memory to cache can boost performance. Also spatial locality since we are moving cache lines (usually 64 bytes).

In a cache coherent system, all cached copies of a region of memory must be the same. Without cache coherence, it is possible for a CPU to read stale data.

Snoopy MSI Protocol is a simple cache coherence protocol.

Each cache line is in one of three states. Each core snoops on the memory bus to determine if it needs to update the state of its cache lines.

Modified

  • Cache line has been updated, other caches must be in invalid state
  • Data in cache line is different than what is in memory
  • Must write data back to memory when transitioning out of this state

Shared

  • One or more caches (and memory) have a valid copy

Invalid

  • Doesn’t contain any data

Operations that are performed in response to a processor request.

stateDiagram-v2

S --> M: PrWr/BusUpgr
I --> M: PrWrBusRdX
I --> S: PrRd/BusRd
M --> M: PrRd/-PrWr/-
S --> S: PrRd/-

Operations that are performed in response to a processor seeing a bus request

stateDiagram-v2

M --> S: BusRd/-
M --> I: BusRdX/Flush
S --> I: BusRdX/-BusUpgr/-
S --> S: BusRd/-
I --> I: BusRd/-BusRdX/-BusUpgr/-

These states are per cache line.

CPU cycles to access cache/memory (Intel Xeon):

  • L1: 3 cycles, L2: 11 cycles, LLC (last level cache): 44 cycles, RAM: 355 cycles

If a remote core has data cached in the modified state

  • Load: 109 cycles
  • Store: 115 cycles

Implications for multithreaded design

  • Avoid false sharing
    • Data is shared in cache line chunks
    • Avoid placing data used by the different threads in the same cache line
  • Align structures to cache lines
    • Place related data you ned to access together
  • Pad data structures
    • Arrays of structures lead to false sharing
    • Add unused fields to ensure alignment
  • Avoid contending on cache lines
    • Reduce costly cache coherence traffic

Problem: twothreads

Assume

. Add at the end to account for the first thread.

Now assume that

Lecture 15

CPU accesses physical memory over a memory bus. Some devices are connected to the system via a general I/O bus. Slow devices are connected to a peripheral bus.

Device drivers communicate with devices by accessing/modifying their registers.

Question

How can a device driver access driver registers?

Option 1: Special I/O instructions

  • Such as in and out instructions on x86
  • Device registers are assigned port numbers
  • Instructions transfer data between a specified port and a CPU register

Option 2: Memory-mapped I/O

  • Each device register has a physical memory address
  • Device drivers can read from or write to device registers using normal load and store instructions, as though accessing memory.

A device driver is a part of the kernel that interacts with a device. Example: write character to serial output device

// only one writer at a time
P(output device write semaphore)
// trigger the write operation
write character to device data register
repeat {
	read writeIRQ register
} until status is "completed"
write writeIRQ register to ack completion
V(output device write semaphore)

This example illustrates polling: The kernel driver repeatedly checks device status.

We can use interrupts to avoid polling.

Device driver read handler:

// only one disk request at a time
P(disk semaphore)
write target sector number to disk sector number register
write "read" command to disk status register
// wat for request to complete
P(disk completion semaphore)
copy data from device transfer buffer to memory
V(disk semaphore)

Interrupt handler for disk device:

// make the device ready again
write disk status register to ack completion
V(disk completion semaphore)

Direct Memory Access (DMA)

DMA is used for block data transfers between devices (e.g. a disk) and memory. Under DMA, the CPU initiates the data transfer and is notified when the transfer is finished. However, the device (not the CPU) controls the transfer itself.

  1. CPU issues DMA request to device
  2. Device directs data transfer
  3. Device interrupts CPU on completion

Anatomy of a disk

  • Stack of magnetic platters
  • Disk arm assembly
    • Arms rotate around pivot, all move together
    • Arms contain disk heads-one for each recording surface
    • Heads read and write data to platters

Platters are divided into concentric tracks. A stack of tracks of fixed radius is a cylinder. Heads record and sense data along cylinders. Significant fractions of encoded stream for error correction.

Generally only one head active at a time. Disks usually have one set of read-write circuitry.

Logical view of a disk drive

  • Disk is an array of numbered sectors
  • Each sector is the same size
  • Sectors are the unit of transfer between the disk and memory
  • Storage is non-volatile, i.e., data persists even when the device is without power

Disk positioning system

  • Move head to specific track and keep it there
    • Resists physical shocks
  • A seek consists of up to four phases:
    • Speedup: Accelerate arm to max speed or half way point
    • Coast: At max speed (for long seeks)
    • Slowdown: Stops arm near destination
    • Settle: Adjusts head to actual desired track
  • Very short seeks dominated by settle time ( ms)
  • Short (200-400 cyl.) seeks dominated by speedup
    • Accelerations of
  • Average seek time is roughly one-third of the full seek time

Moving data to/from a disk involves:

  • Seek time: move the read/write heads to the appropriate cylinder
  • Rotational latency: wait until the desired sectors spin to the read/write heads
  • Transfer time: wait while while the desired sectors spin past the read/write heads

Request service time is the sum of seek time, rotational latency, and transfer time.

Seek Time:

  • If the next request is for data on the same track as the previous request, seek distance and seek time will be zero
  • In the worst-case, e.g., seek from outermost track to innermost track, seek time may be 10 milliseconds or more.

Rotational latency:

  • Consider a disk that spins at RPM
  • One complete rotation takes 5 milliseconds
  • Rotational latency ranges from ms to ms

Transfer time:

  • Once positioned, the RPM disk can read/write all data on a track in one rotation (ms)
  • If only percent of the tracks sector are being read/written, transfer time will be percent of the complete rotation time (ms).

Lecture 16

The placement on disk is very important. If randomized, every sector might be on a different track. Most of the time will be spent seeking rather than transferring data. Random I/O is terrible (same concept from CS348).

We try to achieve contiguous accesses where possible. Try to order requests to minimize seek times.

I/O Scheduling Policies:

First Come First Served (FCFS): Process disk requests in the order they are received

Advantages

  • Easy to implement
  • Good fairness

Disadvantages

  • Cannot exploit request locality (long seek time)
  • Increases average latency, decreasing throughput

Shortest Positioning Time First (SPTF): Always pick request with shortest seek time. Also referred to as Shortest Seek Time First.

Advantages

  • Exploits locality of disk requests
  • Higher throughput

Disadvantages

  • Starvation (one outlier neglected)

Question

How do we fix the starvation?

Give older requests higher priority. Aged SPTF. Adjust effective seek time with weighting factor:

Can be modelled as an elevator problem.

Elevator scheduling (SCAN): Sweep across disk, serving all request passed. Like SPTF, but next seem must be in the same direction. Switch directions only if no further requests.

Advantages

  • Takes advantage of locality
  • Bounded waiting

Disadvantages

  • Cylinders in the middle get better service
  • Might miss locality SPTF could exploit

CSCAN: Only sweep in one direction

VSCAN(): Continuum between SPTF and SCAN. Similar to SPTF, but slightly changes “effective” positioning time. If request in same direction as previous seek: . Otherwise:

Where we get SPTF, when , get SCAN. works well.

Advantages and disadvantages are those of SPTF and SCAN depending on how is set.

Now, people increasingly use flash memory/solid state drives.

Completely solid state (no moving parts).

DRAM: Requires constant power to keep values

  • Transistors with capacitors
  • Capacitor holds microsecond charge; periodically refreshed by primary power

Flash Memory: Traps electrons in quantum cage using floating gate transistors.

Instead of sectors, SSD arranges data into pages and blocks.

  • 2, 4, or 8 pages further divided into 32KB-4MB blocks

We perform operations (read/writes) at the page level. Pages are initialized to 1s; can transition from 1 to 0 at page level (write new page). A high voltage is required to switch from 0 to 1 (overwrite/delete). Cannot apply high voltage at page level, only to blocks.

Writing and deleting from flash memory. Naïve solution is to read whole block into memory. Re-initialize block. Update block in memory; write back to SSD.

Translation layer solution: Mark page to be deleted/overwritten as invalid, write to an unused page, update translation table, requires garbage collection.

SSDs are not impervious. Blocks have a limited number of write cycles. Repeated high voltage will result in wear on the SSD. If a block is no longer writeable; it becomes read-only. When a certain number of blocks are read-only; the disk becomes read-only.

SSD controller wear-levels; ensuring that write cycles are evenly spread across all blocks.

Lecture 17

Most of lecture is midterm take up which I can’t post here.

File systems: traditionally the hardest part of OS.

Main tasks:

  • Don’t go away
  • Associate bytes with names
  • Associate names with names

Files are persistent, named data objects

  • Data consists of a sequence of numbered bytes
  • File may change size over time
  • File has associated meta-data

File systems are the data structures and algorithms used to store, retrieve, and access files. Translating name and offset to disk blocks.

File Interface: Basics

open

  • Returns a file identifier (or handle or descriptor), which is used in subsequent operations to identify the file.
  • Other operations require file descriptor as a parameter.

close

  • Kernel tracks which file descriptors are currently valid for each process.
  • close invalidates the file descriptor.

read,write, seek

  • read copies data from a file into a virtual address space. Returns the number of bytes read.
  • write copies data from a virtual address space into a file.
  • seek enables non-sequential reading/writing.

Each file descriptor (open file) has an associated file position. The position starts at byte 0 when the file is opened.

Read and write operations start from the current file position and update the current file position as bytes are read/written. This makes sequential file I/O easy for an application to request.

Seeks (lseek) are used to achieve non-sequential I/O. It changes the file position associated with a descriptor.

Sequential file reading example

char buf[512];
int i;
int f = open("myfile", O_RDONLY);
for (i = 0; i < 100; i++) {
	read(f, (void *)buf, 512);
}
close(f);

Read the first bytes of a file, bytes at a time.

Read using seek in reverse order example.

char buf[512];
int i;
int f = open("myfile", O_RDONLY);
for (i = 1; i <= 100; i++) {
	lseek(f, (100-i)*512, SEEK_SET);
	read(f, (void *)buf, 512);)
}
close(f);

Read the first bytes of a file, bytes at a time, in reverse order.

A directory maps file names (strings) to i-numbers. An i-number is a unique (within a file system) identifier for a file or directory. Given an i-number, the file system can find the data and meta-data for the file.

Directories provide a way for applications to group related files. Since directories can be nested, a filesystem’s directories can be viewed as a tree, with a single root directory. In a directory tree, files are leaves.

Files may be identified by pathnames, which describe a path through the directory tree from the root directory to the file.

Directories also have pathnames. Applications refer to files using pathnames, not i-numbers.

Lecture 18

A hard link is an association between a name (string) and an i-number. Each entry in a directory is a hard link.

When a file is created, so is a hard link to that file. Once a file is created, additional hard links can be made to it. For example, link(/docs/a.txt,/foo/myA) creates a new hard link myA in directory /foo.

Linking to an existing file creates a new pathname for that file. A file has a unique i-number but may have multiple pathnames. Not possible to link to a directory.

Hard links can be removed. When the last hard link to a file is removed, the file is also removed. Since there are no links, there is no pathname, and the file cannot be opened.

It is not uncommon for a system to have multiple file systems. Some kind of global file namespace is required. DOS/Windows uses two-part file names. The file system name, and the pathname within the file system.

Unix: Creates a single hierarchical namespace that combines the namespace of two file systems.

Mounting does not make two file systems into one file system, it creates a single hierarchical namespace that combines the namespaces of two file systems.

Question

What needs to be stored persistently?

  • File data
  • File meta-data
  • Directories and links
  • File system meta-data

Non-persistent information

  • Open files per process
  • File position for each open file
  • Cached copies of persistent data

We will use an extremely small disk as an example (256 KB disk).

Most disks have a sector size of 512 bytes. 512 total sectors on this disk.

Group every 8 consecutive sectors into a block. Better spatial locality and reduces the number of block pointers.

VSFS: Very Simple File System

Most of the blocks should be for storing user data (last 56 blocks).

Need some way to map files to data blocks. We create an array of nodes, where each node contains the meta-data for a file. The index into the array is the file’s index number (number)

Assume each node is 256 bytes, and we dedicate 5 blocks for nodes. This allows for 80 total nodes/files

We also need to know which nodes and blocks are unused. There are many ways to do this but in VSFS, we use a bitmap.

A block size of 4 KB means we can track 32K nodes and 32K blocks. This is more than we actually need.

We store an node bitmap as well as a node bitmap.

We reserve the first block as the superblock. A superblock contains meta-information about the entire file system.

An node is a fixed size index structure that holds both file meta-data and a small number of pointers to data blocks.

Assume disk blocks can be referenced based on a 4 byte address. In VSFS, an node is 256 bytes. Assume there is enough room for 12 direct pointers to blocks. Each pointer points to a different block for storing user data. Pointers are ordered: First pointer points to the first block in the file.

Question

What is the maximum file size if we only have direct pointers?

KB

In addition to 12 direct pointers, we can also introduce an indirect pointer. This pointer points to a block full of direct pointers.

4 KB block of direct pointers 1024 pointers. Maximum file size is: KB = 4144 KB

What if the disk were larger? We can add a double indirect pointer.

Lecture 19

File System Design

File system parameters:

  • How many nodes should a file system have?
  • How many direct and indirect blocks should an node have?
  • What is the “right” block size?

For a general purpose file system, design it to be efficient for the common case.

  • Most files are small
  • Average file size is growing
  • File systems contain lots of files
  • Directories are typically small
  • A few big files use most of the space

In-Memory (Non-persistent) Structures

Per process

  • Which file descriptors does this process have open?
  • To which file does each open descriptor refer?
  • What is the current file position for each descriptor?

System wide

  • Open file table: Which files are currently open
  • node cache: In-memory copies of recently-used nodes
  • Block cache: In-memory copies of data blocks and indirect blocks

Reading from a File

First read the root node

  • At “well known” position (node 2)
  • node 1 is usually for tracking bad blocks
data bitmapinode bitmaproot inodefoo inodebar inoderoot datafoo databar data[0]bar data[1]bar data[2]
open(bar)readread
read
read
read
read()read
read
write
read()read
read
write
read()read
read
write

Then read the directory information from root. Find the number for foo and read the foo node.

We then find the number for bar and read the bar node.

Permission check (is the user allowed to read this file?). We allocate a file descriptor in the per-process descriptor table. Increment the counter for this entry in the global open file table.

Find the block using a direct/indirect pointer and read the data. Update the node with a new access time. Update the file position. Closing a file deallocates the file descriptor and decrements the counter for its entry in the global open file table.

data bitmapinode bitmaproot inodefoo inodebar inoderoot datafoo databar data[0]bar data[1]bar data[2]
create (/foo/bar)read
read
read
read
read
write
write
read
write
write
write()read
read
write
write
write
write()read
read
write
write
write
write()read
read
write
write
write

Lecture 20

For a file, an entry is created in local file table and open file table. Reference count, offset, and -node are stored in the open file table.

When forking a parent, the reference count in open file table increases. This is because there are two pointers to the same entry now.

Chaining is similar to a linked list, store pointer to next block in current block. External chaining puts these pointers to an external chain. Advantage of this approach: faster since we can cache (in memory vs on disk).

A single local file system operation may require disk I/O operations.

Example

Deleting a file (this is not atomic)

  • Remove entry from directory (remove hard link)
  • Remove file index (-node) from -node table
  • Mark file’s data blocks free in free space index.

Question

What if, because of a failure, not all of these changes are reflected on disk?

System failure will destroy in-memory file system structures.

Persistent structures should be crash consistent. They should be consistent when system restarts after a failure.

Another more pressing example, moving files. Moving a file from /foo/a.txt to /bar/a.txt

  1. Update foo’s data block
  2. Update foo’s -node
  3. Update bar’s data block
  4. Update bar’s -node

Cutting off the power after step 2, before step 3. The only hard link to the file is gone. Data is still on the hard drive but can’t locate. Move has become a bad delete (file still has a reference count of 1, but with no reference).

We can flip the ordering

  1. Update bar’s -node
  2. Update bar’s data block
  3. Update foo’s data block
  4. Update foo’s -node

Now if it breaks after 2, our move has become a copy. User will probably want to delete file from foo. But this is dangerous since we never updated the reference count of the -node of a.txt (it is , not ). This removes the entire file.

Question

Why not just update the reference / hard link count before and after?

There is overhead. This isn’t necessary if we didn’t think it would crash.

Fault tolerance:

  • Special-purpose consistency checkers
    • Runs after a crash, before normal operations resume
    • Find and attempt to repair inconsistent file system data structures (i.e. file with no directory entry, free space that is not marked as free)

This can take an extremely long time.

  • Journaling
    • Record file system meta-data changes in a journal, so sequences of changes can be written to disk in a single operation
    • After changes have been journaled, update disk data structures
    • After a failure, redo journaled updates in case they were not done before the failure
    • We use checkpoints to remove old logs
    • We can do this because updating blocks is idempotent

Every process has its own virtual address space. Great from abstraction perspective.

Amount of physical memory is determined by the physical address size. Physical address is 18 bits 256KB total physical memory

If physical addresses have bits, the maximum amount of addressable physical memory is

The actual amount of physical memory on a machine may be less than the maximum amount that can be addressed.

Size of virtual address space does not have to be the same size as the physical address space.

If virtual addresses are 16 bits, maximum virtual memory size is KB

Question

How do we map virtual memory to physical memory?

The kernel provides a separate, private virtual memory for each process. Virtual memory of a process holds the code, data, and stack for the program that is running in that process.

If virtual addresses are bits, the maximum size of a virtual memory is bytes.

Running applications see only virtual addresses. Program counter and stock pointer hold virtual addresses. Pointers to variables are virtual addresses. Jumps and branches refer to virtual addresses.

Each process is isolated in its virtual memory, and cannot access other process’ virtual memories.

Each virtual memory is mapped to a different part of physical memory. When a process tries to access (load or store) a virtual address, the virtual address is translated (mapped) to its corresponding physical address, and the load or store is performed in physical memory.

Address translation is performed in hardware, using information provided by the kernel.

Dynamic allocation keeps track of an offset and a limit. Offset tells us where to map, and limit tells us the size of the virtual memory we are mapping.

CPI includes a memory management unit (MMU), with a relocation register and a limit register.

  • Relocation register holds the physical offset () for the running process’ virtual memory
  • Limit register holds the size of the running process virtual memory

To translate a virtual address to a physical address :

if v >= L // generate exception
else:
	p = v + R

Translation is done in hardware by MMU.

Lecture 21

The kernel maintains a separate and for each process, and changes the values in the MMU registers when there is a context switch between processes.

Each virtual address space corresponds to a contiguous range of physical addresses.

The kernel is responsible for deciding where each virtual address space should map in physical memory. The OS must track which parts of physical memory are in use, and which parts are free. OS must allocate/deallocate variable-sized chunks of physical memory.

This creates the potential for fragmentation of physical memory.

There are two types of fragmentation. The above is external fragmentation, but there also exists internal fragmentation.

Example

Dynamic Relocation Examples

Process :

Limit Register (): 0x0000 7000 Relocation Register (): 0x0002 4000

v = 0x102c

Since v is less than , we add : p = 0x2502c

v = 0x8800 p = Exception

v = 0x0000 p=0x24000

Paging: Physical Memory

Refer to physical memory as frames (instead of physical pages). Refer to virtual memory as pages.

The size of a page should be the same size as a frame. Each frame size 4KB.

Each page maps to a different frame. Any page can map to any frame.

4KB is the smallest amount we can allocate. If we only need 1 byte, we still must assign at least 4KB. This is internal fragmentation. We cannot safely reassign this empty space to something else.

There is a tradeoff between internal and external fragmentation.

Question

Why not smaller page sizes?

Introduces a lot of overhead.

An alternative approach to having many registers is to store the relocation values in memory (as a page table).

Separate page table for every single process

Process Page Table

PageFrameValid?
0x00x0f1
0x10x261
0x20x271
0x30x281

Valid tells us if memory has been allocated or not.

The MMU includes a page table base register which points to the page table for the current process.

How the MMU translates a virtual address:

  1. Determines the page number and offset of the virtual address
    1. Page number is virtual address divided by the page size
    2. Offset is the virtual address modulo the page size (the byte we are interested in)
  2. Looks up the page’s entry (PTE) in the current process page table, using the page number
  3. If the PTE is not valid, raise an exception
  4. Otherwise, combine page’s frame number from the PTE with the offset to determine the physical address
    1. Physical address is (frame number frame size) offset

Process virtual address (16 bits)

The last 12 bits are the offset, the first 4 bits are the page number.

We only send the first 4 bits to the page table lookup to determine the frame number for the physical address. These 4 bits become 6 bits (frame number). The last 12 bits are the same offset in the virtual address and physical address (this is because the size of the frame is the same as the size of the page).

Example

Address Translation

PageFrameValid?
0x00x0f1
0x10x261
0x20x271
0x30x281
0x40x111
0x50x121
0x60x131
0x70x000
0x80x000

Last 3 digits are for offset, first 3 are for page number.

v = 0x102cPage number is 0x1 Frame 0x26 with offset of 02c

p = 0x2602c

v = 0x9800 Page number is 0x9 Frame 0x00

p = Exception

v = 0x0024 Page number is 0x0 Frame 0x0f with offset 024

p = 0x0f024

PTEs may contain other fields, in addition to the frame number and valid bit.

Example 1: Write protection bit

  • Can be set by the kernel to indicate that a page is read-only
  • If a write operation to memory uses a virtual address on a read-only page, the MMU will raise an exception when it translates the virtual address

Example 2: Bits to track page usage

  • Reference (use) bit: has the process used this page recently?
  • Dirty bit: have contents of this page been changed?
  • These bits are set by the MMU, and read by the kernel

A page table has one PTE for each page in the virtual memory

  • Page table size = (number of pages) (size of PTE)
  • Number of pages = (virtual memory size) (page size)

The page table: 64KB of virtual memory, with 4KB pages, is 64 bytes, assuming 32 bits for each PTE. Page tables for larger virtual memories are larger.

Page tables are kernel data structures.

Kernel:

  • Manage MMU registers on address space switches
  • Create and manage page tables
  • Manage physical memory
  • Handle exceptions raised by the MMU

MMU:

  • Translate virtual addresses to physical addresses
  • Check for and raise exceptions when necessary

Execution of each machine instruction may involve one, two or more memory operations. One to fetch instructions, one or more for instruction operands.

Address translation through a page table adds one extra memory operation (for PTE lookup) for each memory operation performed during instruction execution.

This can be slow.

Solution: Include a Translation Lookaside Buffer (TLB) in the MMU

  • TLB is a small, fast, dedicated cache of address translations, in the MMU
  • Each TLB entry stores a (page frame) mapping

What the MMU does to translate a virtual address on page :

if there is an entry (p, f) in the TLB then
	return f // TLB hit
else
	find p's frame number (f) from the page table
	add (p,f) to the TLB, evicting another entry if full
	return f // TLB miss

If the MMU cannot distinguish TLB entries from different address spaces, then the kernel must clear or invalidate the TLB on each context switch from one process to another.

Contiguous memory allocation is not common.

Lecture 22

MIPS: , max virtual memory size is bytes

x86-64: , max virtual memory size is bytes

Much of the virtual memory may be unused. Application can use discontiguous segments of virtual memory

Example

Sizes

32 bit address space.

Each page size is 4kb

Question

How many bits is the offset?

12 bit offset. = 4KB. Size of the page tells us how big the offset is. Page number is 20 number of page table entries . Each page table entry is 4 bytes. Page table is actually larger than the virtual memory, most of the page table entries are marked as invalid.

Segmentation: Instead of mapping the entire virtual memory to physical, we can provide a separate mapping for each segment of the virtual memory that the application actually uses. The kernel maintains an offset and limit for each segment.

With bits for the segment ID, we can have up to:

  • segments
  • bytes per segment

Kernel decides where each segment is placed in physical memory.

Better than dynamic allocation, but there still exists external fragmentation.

Many different approaches for translating segmented virtual addresses.

Approach 1: MMU has a relocation register and a limit register for each segment.

Let be the relocation offset and be the limit offset for the th segment

To translate virtual address to a physical address :

split v into segment number (s) and
	address within segment (a)
if a >= L_s then generate exception
else
	p = a + R_s

Example

Segmented Address Translation

Limit Register 0: 0x2000

Limit Register 1: 0x5000

Relocation Register 0: 0x38000

Relocation Register 1: 0x10000

0x1240 Segment = 0 Offset = 0x1240 0x39240 (Relocation register 0 + Offset)

First bit is the segment bit. The rest is the address. Breaking up hex 1 into binary 0001. First bit is 0 segment = 0. We only use 1 bit () to determine the segment (since there are only two segments).

0xa0a0 Segment = 1 Offset = 0x20a0 0x120a0 (Relocation Register 1 + Offset)

0x66ax Segment = 0 Offset = 0x66ac Exception

Translating segmented virtual addresses, we can maintain a segment table.

Virtual address is broken up into segment number and offset.

We determine if the segment number is valid. If it is valid, it is an index into the segment table. The segment table base register tells us where the segment table is located in the kernel. Each entry in the segment table stores the size and the start address (relocation value). To get the physical address, we add the start from the segment table with the offset.

Two level paging: Instead of having a single page table to map an entire virtual memory, we can split the page table into smaller page tables, and add a page table directory.

Each virtual address has three parts:

  1. Level one page number: used to index the directory
  2. Level two page number: used to index a page table
  3. Offset within a page

16 bits

  • Page number is 4 bits
  • Offset is 12 bits

Number of page table entries is 16.

Alternatively: 16 bits

  • is 2 bits
  • is 2 bits
  • Offset is 12 bits

First four entries are mapped to first value. We store the pointer to the second level page table (that stores these 4 entries). The first two bits are used to index into the first level page. The remaining two bits are used to index within the second level page table.

The MMU’s page table base register points to the page table directory for the current process.

Each virtual address has three parts:

How the MMU translates a virtual address:

  1. Index into the page table directory using to get a pointer to a 2nd level page table
  2. If the directory entry is not valid, raise an exception
  3. Index to the 2nd level page table using to find the PTE for the page
  4. If the PTE is not valid, raise an exception
  5. Otherwise, combine the frame number from the PTE with to determine the physical address being accessed.

One goal of two-level paging was to keep individual page tables small.

Suppose we have a 40 bit virtual address and that

  • PTE is 4 bytes
  • page size is 4KB ( bytes)
  • We’d like to limit each page table’s size to 4KB

40 bits

  • Offset is 12 bits
  • 4 (pte) = K . Thus the number of bits for is 10.
  • gets 18 bits ()

Each of the second level page table is exactly 4KB in size. Our page directory will have to have entries (which is large). How can we shrink this if we must (i.e. max size must be 4KB)?

  • We can break up further into and

Thus we need three levels of paging to ensure that each page table’s size is no greater than 4KB.

This is multi-level paging.

Properties of multi-level paging

  • Can map large virtual memories by adding more levels
  • Individual page tables/directories can remain small
  • Can avoid allocating page tables and directories that are not needed for programs that use a small amount of virtual memory
  • TLB misses become more expensive as the number of levels go up, since more directories must be accesses to find the correct PTE

We want the kernel to live in virtual memory, but there are challenges.

  1. Bootstrapping: Since the kernel helps to implement virtual memory, how can the kernel run in virtual memory when it’s just starting?
  2. Sharing: Sometimes data needs to be copied between the kernel and application programs. How can this happen in different virtual address spaces.

We can make the kernel’s virtual memory overlap with process’ virtual memories.

Exploiting secondary storage

Goals

  • Allow virtual address spaces that are larger than the physical address space
  • Allow greater multiprogramming levels by using less of the available (primary) memory for each process

Method:

  • Allow pages from virtual memories to be stored in secondary storage
  • Swap pages between secondary storage and primary memory so that they are in primary memory when they are needed