Professor: Bernard Wong | Term: Winter 2025
Lecture 1
Question
What is an operating system?
- Layer between applications and hardware (gcc <> OS <> Hardware)
- Makes hardware useful to the programmer
- Usually provides abstractions for applications
- Manages and hides details of hardware
- Accesses hardware through low/level interfaces unavailable to applications
- Often: Provides protection
- Prevents one process/user from clobbering another
Primitive Operating systems are just a library of standard services (no protection).
Simplifying assumptions:
- This system only runs one program at a time (hence no need for protection).
- There are no bad users or programs (this is a bad assumption)
Problem: Poor utilization
- of hardware
- of human user
Multitasking We can run more than one process at once. When one process blocks, run another process.
Question
What can ill-behaved process do?
Go into infinite loop and never relinquish CPU. Scribble over other processes’ memory to make them fail
OS provides mechanisms to address these problems
- Preemption (take CPU away from looping process)
- Memory protection (Protect process’s memory from one another)
Multi-user OSes
Many OSes use protection to serve distrustful users/apps
Idea: With users, system not times slower
- User demand for CPU is bursty
Question
What can go wrong?
- Users are gluttons, use too much CPU (need policies)
- Total memory usage greater than in machine (must virtualize)
- Super-linear slowdown with increasing demand (thrashing)
Protection
- Mechanisms that isolate bad programs and people
- Pre-emption:
- Give application a resource, take it away if needed elsewhere
- Interposition/meditation:
- Place OS between application and “stuff”
- Track all pieces that application allowed to use
- On every access, look in table to check that access is legal
- Privileged & Unprivileged modes in CPUs
- Applications unprivileged
- OS privileged
- Protection operations can only be done in privileged mode
Most software runs as user-level processes
OS kernel runs in privileged mode
- Creates/deletes processes
- Provides access to hardware
Question
So how does user “talk” to kernel?
OS provides a system call interface to safely communicate with kernel.
Applications can invoke kernel through system calls
- Special instruction transfers control to kernel
- Which dispatches to one of few hundred syscall handlers
Lecture 2
System Calls: Do things app. Can’t do in unprivileged mode.
Kernel supplies well-defined system call interface.
- Applications set up syscall arguments and trap to kernel.
- Kernel performs operation and returns result
Higher-level functions are built on the syscall interface (i.e.printf
, scanf
)
Processes, threads, locks, file i/o are some examples of abstractions from the operating system.
A process is an instance of a program running.
Modern operating systems can run multiple processes simultaneously.
Multiple processes can increase CPU utilization and reduce latency.
takes 80 seconds, takes 20 seconds.
Running then requires 100 seconds for to complete.
seconds on average.
If we can run and concurrently, we can make finish faster.
runs, then runs for 10 seconds, then runs, then runs for 10 seconds, then completes.
is slower than if it had the whole machine to itself.
seconds on average.
Parallelism is a fact of life. Most computers, laptops and phones are multi-core. Computer with 4 cores can run 4 processes in parallel. Ideally we get 4 times the throughput.
Each process has own view of machine. They have their own address space, own open files, own virtual CPU.
*(char *)0xc000
is different in and
Physical Memory
stack | grows down |
---|---|
heap | grows up |
data | global variables |
text | contains actual code |
Every process has something that looks like this. This simplifies the programming model. gcc
does not care that firefox is running.
int fork(void);
Create new process that is exact copy of current one. Returns process ID
of new process in “parent”. Returns in child. (Ugly fork questions usually appear on the midterm).
int waitpid(int pid, int *stat, int opt
- Wait for a child process to terminate
pid
- process to wait forstat
- will contain exit value, or signal from childopt
- usually or WNOHANG- Returns
process ID
Lecture 3
We run processes in unprivileged mode. Within our processes we have stacks.
We run close()
in userspace system call library. We then load the call param, set the syscall number, and then we run syscall
. We then switch to the Kernel stack.
We then save the complete processor state into a trap frame.
We check whether this is an exception, interrupt, or system call (handled by mips_trap
)
Once syscall
returns, it modifies register values stored in the trap_frame
.
if (err) {
tf->tf_v0 = err;
tf->tf_a3 = 1; // error
} else {
// Success
tf->tf_v0 = retval;
tf->tf_a3 = 0; // no error
}
// Advance the PC, to avoid the syscall
tf->tf_epc += 4;
void exit(int status)
Question
What output will be generated by the parent and child processes for the program shown below?
int x;
int main() {
int rc;
x = 0;
rc = fork();
if (rc == 0) {
x = 10;
printf("A: %d\n", x);
} else {
printf("B: %d\n", x);
x = 100;
}
printf("C: %d\n");
}
Parent
x=0
rc = pid of child
"B: 0"
x = 100
"C: 100"
Child
rc = 0
x = 10
"A: 10"
"C: 10"
Question
Consider a system with a process,
ProcA
, that executes the program below. Draw the tree rooted atProcA
representing the processes created by this program. Every node in the tree should represent a process, and there should be an edge from node to node if process creates process . Label each node of the tree with the label of thefork()
system call that created it.
int main() {
fork(); // Label B
fork(); // Label C
fork(); // Label D
return 0;
}
flowchart TD
id1((A)) --> id2((B))
id2((B)) --> id5((C))
id5((C)) --> id7((D))
id2((B)) --> id6((D))
id1((A)) --> id3((C))
id3((C)) --> id8((D))
id1((A)) --> id4((D))
int main(int argc, char *argv[]) {
int status;
int x = 1;
status = fork();
if (status == 0) {
x += 1;
} else {
x *= 2;
waitpid(status, NULL, 0);
}
printf("%d\n", x);
status = fork();
if (status == 0) {
x *= 2;
} else {
x *= 3;
}
printf("%d\n", x);
}
Parent
x = 1
"1"
spawn child
x = 2
waitpid - wait for child process to complete (__ symbol)
"2"
Spawn another child
Child
x = 2
"2"
spawn child of child
x=6
"6"
__ dotted line to waitpid symbol of parent. Can continue from that point on
Child of Child
x = 4
"4"
Second child
x = 6
"6"
Second child’s child
x=4
"4"
Lecture 4
Sketch of previous example.
One possible output: 1, 2, 4, 6, 2, 4, 6
One not possible output: 1, 6, 2, 4
We don’t expect the parent to print before the children finish waitpid
.
Another possible output: 1, 2, 6, 2, 6, 4, 4 (last 4 is grandchild)
Another possible output: 1, 2, 6, 4, 4, 6, 4
{c} void exit(int status)
- Current process ceases to exist
status
shows up inwaitpid
- By convention, status of is success, non-zero error
{c} int kill(int pid, int sig)
- Sends signal
sig
to processpid
SIGTERM
most common value, kills process by defaultSIGKILL
stronger, kills process always
{c} int execve(char *prog, char **argv, char **envp);
- Execute a new program
prog
- a full pathname of program to runargv
- argument vector that gets passed to mainenvp
environment variables
It is generally called through a wrapper functions.
{c} int execvp(char *prog, char **argv);
SearchPATH
for prog, use current environment{c} int execlp(char *prog, char *arg, ...);
List arguments one at a time, finish withNULL
.
pid_t pid; char **av;
void doexec() {
execvp(av[0], av);
perror(av[0]);
exit(1);
}
/* ... main loop: */
for (;;) {
parse_input(&av, stdin);
switch (pid = fork()) {
case -1:
perror("fork"); break;
case 0:
doexec();
default:
waitpid(pid, NULL, 0); break;
}
}
File descriptors:
Each process has its own file descriptor table. What is stored inside is a pointer to the global open file table (with an offset) which has information necessary for the operating system to operate on the specific file.
Special file descriptors
- 0:
STDIN
- 1:
STDOUT
- 2:
STDERR
{bash} cat file.txt | wc - 1
STDOUT
of{bash}cat
is sent to the STDIN of{bash}wc
{c} int dup2(int oldf, int newfd);
- Closes
{c}newfd
, if it was a valid descriptor - Makes
{c}newfd
and exact copy of{c}oldfd
- Two file descriptors will share the same offset
Example: redirsh.c
- Look that reads a command and executes it
void doexec (void) {
int fd;
if (infile) { /* non-NULL for "command < infile" */
if ((fd = open(infile, O_RDONLY)) < 0) {
perror(infile);
exit(1);
}
if (fd != 0) {
dup2(fd, 0);
close(fd);
}
}
/* ... do same for outfile→fd 1, errfile→fd 2 ... */
execvp (av[0], av);
perror (av[0]);
exit (1);
}
Pipes
{c} int pipe(int fds[2]);
- Returns two file descriptors in
{c} fds[0]
and{c} fds[1]
- Writes to
{c} fds[1]
will be read on{c} fds[0]
- When last copy of
{c} fds[1]
is closed,{c} fds[0]
will return EOF - Returns 0 on success, -1 on error
Operations on pipes:
read/write/close
- as with files- When
{c} fds[1]
closed,{c} read(fds[0])
returns 0 bytes - When
{c} fds[0]
closed,{c} write(fds[1])
:- Kills process with
SIGPIPE
- Or if signal ignored, fails with
EPIPE
- Kills process with
This is useful when creating forks. Parent can write via ingress and child can later read it via egress. This is only possible because they share the same file descriptor in the global open file table.
Example
pipesh.c
: Sets up pipelinecommand1 | command2 | command 3 ...
Lecture 5
Pipes aren’t very useful in just one process. Become useful once we fork and copy the file descriptors to the children.
void doexec(void) {
while(outcmd) {
int pipefds[2]; pipe(pipefds);
switch (fork()) {
case -1:
perror("fork"); exit(1);
case 0: // fork runs here
dup2(pipefds[1], 1); // ingres, stdout
// replacing stdout with fds[1]
// cmd out goes to right side of pipe instead of stdout
// reference count is 3
close(pipefds[0]); close(pipefds[1]);
// reference count is 1
outcmd = NULL;
break
default:
dup2(pipedfs[0], 0); // replace stdin with read side of pipe. fds[0] points to stdin, stdin points to where fds[0] used ot point
close(pipefds[0]); close(pipefds[1]);
parse_input(&av, &outcmd, outcmd);
break;
}
}
stdout
of child sends to stdin
of parent.
Question
Why fork?
Most calls to fork
followed by execve
Occasionally useful to fork one process
- Creates one process per core to serve clients
Simplicity of interface
fork
requires no arguments at all
Kernel view of process
- OS keeps data structure for each procedure
- Process Control Block (PCB)
PCB |
---|
Process State |
Process ID |
User id |
Program counter |
Registers |
Address space |
Open files |
- Tracks state of process
- Running, ready, blocked
- Includes information necessary to run
- Registers, vm mappings
- Open files
- Other data about the process
- Credentials, controlling terminal, etc.
flowchart LR
new --> ready
ready --> running
waiting --> ready
running --> terminated
running --> waiting
running --> ready
Process can be in one of several states
- New and terminated at beginning and end of file
- Running - Currently executing
- Ready - Can run, but kernel has chosen a different process to run
- Waiting - Needs async event to proceed
- Preemption causes the transition from running to ready (instead of interrupt)
Question
Which process should kernel run?
This is a scheduling problem.
Question
Scan process table for first runnable?
This is expensive with weird priorities.
Question
FIFO/Round-Robin?
When a process is ready to run, it goes at the end of a ready queue. Take the top off once the CPU has space. Runs until it no longer can.
FIFO has no preemption. Round-robin has preemption, process can only run for a certain amount of time (scheduling quanta). Gets added to the back of the queue once its done.
Can preempt a process when kernel gets control.
Running process can vector control to kernel. May put current process to sleep. Make make other process runnable
Periodic timer interrupt. Schedule another process if the running process used up quantum.
Changing running process is called a context switch.
Lecture 6
Once a timer interrupt happens, we need a trap frame to save a snapshot of the current state on the kernel stack (to make sure we continue from the right spot).
We call mips_trap()
(tells us an interrupt happens), then go to the interrupt handler, and timer interrupt handler.
We then call a switch frame after running more stuff after mips_trap()
(stores state of process at the current time). Similar to trap frame but not the same. Can only have one switch frame per kernel stack, but can have multiple trap frames.
The Kernel stack for process 2 ended with a switch frame (we use this to continue from where we left off). Most common use of trap frame is from transition from user space to kernel (but not always). The second kernel stack “pops off” (because there is nothing left to do) to allow us to go back to user space so the thread in process 2 resumes. Process 2 keeps running (until it gets another interrupt).
If we kept getting interrupts we’d get too many trap frames. We want the interrupts to be off as long as we are handling them. The moment the interrupt handler is done running, we can turn them back on.
Context switching is very machine dependent. But usually includes:
- Save program counter and integer registers
- Save floating point or other special registers
- Save condition codes
- Change virtual address translations
Non-negligible cost
- Save/Restore floating point registers
- Optimization: only save if process used floating point
- May require flushing TLB
- Hardware Optimization 1: Don’t flush kernel’s own data from TLB
- Hardware Optimization 2: Use tag to avoid flushing any data
- Usually causes more cache misses
Lecture 7
argstart = DMPA2VA(PMap_Translate(...))
argstart
is a kernel address
Parameter ofPMap_Translate
= user space address, and PMap_Translate
translates it to a physical address.
Direct Mapping Physical Address 2 Virtual Address
Threads
For each process we have code, data, files, registers, stack. A process has a single sequential flow (one thread).
A thread is a schedulable execution context.
Multithreaded processes share the address space and code, data, files. But they do not share registers and stack (execution context).
Question
Why threads?
Most popular abstraction for concurrency. Lighter-weight abstraction than processes. All threads in one process share memory, file descriptors etc.
Allows one process to use multiple CPUs are cores.
Allows program to overlap I/O and computation
- Same benefit as OS running
emacs
andgcc
simultaneously
Example
Threaded web server services clients simultaneously
Thread creation is cheaper than processor creation.
Most kernels have threads.
Example
In Google Chrome, each tab is a separate process (not thread), allowing for isolated environments.
{cpp} int pthread_create(pthread_t *thr, pthread_attr_t *attr, void *(*fn) (void *), void *arg);
- Creates a new thread identified by
thr
with optional attributes, runfn
witharg
{cpp} void pthread_exit(void *return_value);
- Destroy current thread and return a pointer
{cpp} int pthread_join(pthread_t thread, void **return_value);
- Wait for
thread
to exit and receive the return value. Thread version ofwaitpid
{cpp} void pthread_yield();
- Tell the OS scheduler to run another thread or process
Plus lots of support for synchronization.
One-to-one mode: kernel threads are one-to-one with user thread. Every thread has a user space component and a kernel component. The same user thread is mapped to the same kernel thread.
Can implement pthread_create
as a system call.
To add pthread_create
to an OS:
- Start with process abstraction in kernel
pthread_create
like process creation with features stripped out- Keep same address space, file table, etc., in new process
rfork/clone
syscalls actually allow individual control (creates kernel thread)
This is faster than a process, but still heavy. We still need syscalls, these are expensive.
Limitations of kernel-level threads
Every thread operation must go through kernel
- Syscall takes 100 cycles, function call 2 cycles
- Result: Threads slower when implemented in kernel
One-size fits all thread implementation
- Kernel threads must please all ‘people’
General heavy-weight memory requirements
- Other data structures designed for heavier-weight processes
-to-one model: One kernel thread per process. Meaning multiple user threads per kernel thread.
Implementing user-level threads
- Allocate a new stack for each
pthread_create
- Keep a queue of runnable threads
- Replace booking system calls (read/write)
- If operation would block, switch and run different thread
- Schedule periodic timer signal (
setitimer
)- Switch to another thread on timer signals
- Multi-threaded web server
- Thread calls
read
to get data from remote web browser
- Thread calls
Lecture 8
Limitations of user-level threads
- Can’t take advantage of multiple CPUs or cores
- Blocking system call blocks all threads
- A page fault blocks all threads
- Possible deadlock if one thread blocks another
-to-one model isn’t entirely pointless. Useful when program is naturally multi-threaded. We don’t need a lot of cores but need threads. With one-to-one this results in many context switches.
User threads implemented on kernel threads
- Multiple kernel-level threads per process
thread_create
,thread_exit
still library functions as before
Sometimes called threading
- Have user threads per kernel threads
Many of the same problems as threads
- Blocked threads, deadlock
Hard to keep same number of kernel threads as available CPUs
- Kernel knows how many CPUs available
- Kernel knows which kernel-level threads are blocked
- Tries to hide these things from applications for transparency
Kernel doesn’t know relative importance of threads
- Might preempt kernel thread in which library holds important lock
Lesson
- Threads are best implemented as a library
- Shouldn’t be dissuaded from using threads
The part of a concurrent program in which a shared object is accessed is called a critical section
Question
What happens if several threads try to access the same global variable or heap object at the same time
int total = 0;
void add() {
for (int i = 0; i < N; i++) {
total++;
}
}
void sub() {
for (int i = 0; i < N; i++) {
total--;
}
}
int total = 0;
void add() {
for (int i = 0; i < N; i++) {
// total++
lw r9, 0(r8)
add r9, 1
sw r9, 0(r8)
}
}
void sub() {
for (int i = 0; i < N; i++) {
//total--;
lw r0, 0(r8)
sub r9, 1
sw r9, 0(r8)
}
}
Schedule 1:
Thread 1 runs all of add()
first, then thread 2 runs all of sub()
. Increment completed then decrement total = 0
Schedule 2:
Commands from add()
and sub()
execute back and forth.
Thread 1 | Thread 2 |
---|---|
lw r9, 0(r8) | |
lw r9, 0(r8) | |
add r9, 1 | |
sub r9, 1 | |
sw r9, 0(r8) | |
sw r9, 0(r8) |
Each thread has its own context. These are different r9
’s. We get then . The value of total is
Schedule 3: (executing concurrently more than one core)
Thread 1 | Thread 2 |
---|---|
lw r9, 0(r8) | lw r9, 0(r8) |
add r9, 1 | sub r9, 1 |
sw r9, 0(r8) | |
sw r9, 0(r8) |
Thread 1 result is kept, ignoring the results from thread 2 total = 1.
These are race conditions.
To prevent race conditions, we can enforce mutual exclusion on critical sections in the code. The critical section in the code above is total++
and total--
.
First solution attempt
int total = 0
bool lock = false;
void add() {
for (int i = 0; i < N; i++) {
while (lock == true);
lock = true
total++;
lock = false;
}
}
void sub() {
for (int i = 0; i < N; i++) {
while (lock == true);
lock = true;
total--;
lock = false;
}
}
This does not work. Results in back-and-forth context switching. We are testing whether the lock is available, and then setting the lock. This is the issue.
Lecture 9
Desired properties of solution
- Mutual exclusion
- Progress (if nothing is in the critical section, process trying to enter will get in)
- Bounded waiting
- Note progress vs. bounded waiting
Peterson’s solution
// Only two threads, t is current thread id (0 or 1)
int n_turn = 0;
bool w[2] = {false,false};
int total = 0;
void add() {
for (int i = 0; i < N; i++) {
w[t] = true;
n_turn = t;
while(w[1-t] && n_turn == t);
total++;
w[t] = false;
}
}
// sub is the same except total--
Question
Does Peterson’s solution work?
- Mutual exclusion - can’t both be in critical section
- Progress - If not in critical section, can’t block
- Bounded waiting - similar argument to progress
Peterson expensive, only works for 2 processes. Can generalize to , but for some fixed .
The implementation requires Sequential Consistency.
Program A
int flag1 = 0, flag2 = 0
void p1(void *ignored) {
flag1 = 1;
if (!flag2) {
critical_section_1();
}
}
void p2(void *ignored) {
flag2 = 1;
if (!flag1) {
critical_section_2();
}
}
int main() {
pthread_t tid;
pthread_create(tid, NULL, p1, NULL);
p2();
pthread_join(tid);
}
Question
Can both critical sections run?
Program B
int data = 0, ready = 0;
void p1(void *ignored) {
data = 2000;
ready = 1;
}
void p2(void *ignored) {
while (!ready)
;
use(data);
}
Question
Can
use
be called with value 0?
Program C
int a = 0, b = 0;
void p1(void *ignored) {
a = 1;
}
void p2(void *ignored) {
if (a == 1) {
b = 1;
}
}
void p3(void *ignored) {
if(b == 1) {
use(a);
}
}
Question
Can
use()
be called with value 0?
The answer for all above is “We don’t know”.
It depends on the machine that we use.
Definition
Sequential Consistency: The result of execution is as if all operations were executed in some sequential order, and the operations of each processor occurred in the order specified by the program.
Boils down to:
- Maintaining program order on individual processors
- Ensuring write atomicity
Without Sequential Consistency, multiple CPUs can be worse than preemptive threads.
Not all hardware supports sequential consistency because:
- Can’t re-order overlapping write operations
- Reduces opportunities for data prefetching
- Makes cache coherence more expensive
Overall it thwarts hardware optimizations
x86 supports multiple consistency/caching models.
Choices include:
- WB: Write-back caching
- WT: Write-through caching
- UC: Uncacheable
- WC: Write-combining
x86 atomicity
lock
- prefix makes a memory instruction atomic
- Locks bus for duration of instruction (expensive)
- Can avoid locking if memory already exclusively cached
- All lock instructions totally ordered
- Other memory instructions cannot be re-ordered with locked ones
xchg
- exchange instruction is always locked
cmpxchg
- compare and exchange is also locked
Special fence instructions can prevent re-ordering
Lecture 10
We often reason about concurrent code assuming sequential consistency, but for low-level code it’s important to know the memory model.
Lock acquire and release with xchg
Acquire(bool *lock) {
while (Xchg(true, lock) == true);
}
Release(bool *lock) {
*lock = false; // give up lock
}
If xchg
returns true, the lock was already set, and we must continue to loop. If it returns false, the lock was free and we have now acquired it. This is known as a spin lock.
This implementation is almost correct without sequential consistency.
Can be fixed by adding a fence or replacing *lock = false
with xchg(false, lock)
.
In addition to spin locks, most operating systems also provide locks (a lock is known as a mutex).
Locks are used to enforce mutual exclusion
mutex_lock(lock);
// --- critical section ---
mutex_unlock(lock);
Spinlocks spin, locks block:
- A thread that calls
spinlock_acquire
spins until the lock can be acquired - A thread that calls
mutex_lock
blocks until the lock can be acquired.
Sometimes a thread will need to wait for something else
- Lock to be released by another thread
- Data from a slow device
- Input from a keyboard
When a thread blocks, it stops running:
- The scheduler chooses a new thread to run
- A context switch from the blocking thread to the new thread occurs
Eventually, a blocked thread is signaled and awakened by another thread
Wait channels are used to implement thread blocking
- Manages a list of sleeping threads
- Abstracts details of the thread scheduler
void WaitChannel_Lock(WaitChannel *wc)
- Lock wait channel operations
- Prevents a race between sleep and wake
void WaitChannel_Sleep(WaitChannel *wc)
- Blocks calling thread on wait channel
wc
void WaitChannel_WakeAll(WaitChannel *wc)
- Unblocks all threads sleeping on the wait channel
void WaitChannel_Wake(WaitChannel *wc)
- Unblocks one threads sleeping on the wait channel
Function names in this lecture all based on pthreads
int pthread_mutex_init(pthread_mutex_t *m, pthread_mutexattr_t attr)
- Initialize a mutex
int pthread_mutex_destroy(pthread_mutex_t *m)
- Destroy a mutex
int pthread_mutex_lock(pthread_mutex_t *m)
- Acquire a mutex
int pthread_mutex_unlock(pthread_mutex_t *m)
- Release a mutex
int pthread_mutex_trylock(pthread_mutex_t *m)
- Attempt to acquire a mutex
All global data should be protected by a mutex. Global accessed by more than one thread, at least one write. Exception is initialization, before exposed to other threads. This is the responsibility of the application writer.
Compiler/Runtime Contract (C, Java, Go)
Assuming no data races, the program behaves sequentially consistent.
If using mutexes properly, behaviour should be indistinguishable from Sequential Consistency.
- Responsibility of the threads package and compiler
- Mutex is broken if you use properly and don’t see sequential consistency.
OS kernels also need synchronization. Some mechanisms look like mutexes. Interrupts complicate things.
Lecture 11
We don’t always want to use mutex in the kernel.
Producer/Consumer model. Breaking up large amounts of work.
void producer (void *ignored) {
for (;;) {
item *nextProduced = produce_item();
while (count == BUFFER_SIZE)
// do nothing (spin) - wait for consumer
buffer[in] = nextProduced
in = (in + 1) % BUFFER_SIZE;
count++
}
}
void consumer (void *ignored) {
for (;;) {
while (count == 0)
// do nothing (spin) - nothing to consume
item *nextConsumed = buffer[out];
out = (out + 1) % BUFFER_SIZE;
count--;
consume_item(nextConsumed);
}
}
This is incorrect. count
is a global variable, we need to provide mutual exclusion.
The entire while loop is the critical section. Lock acquire above the while, lock release after the count**
. This still doesn’t solve it. If we start with consumer and count == 0
, holds lock and spins. Producer can’t do anything with the lock. Will keep spinning in consumer.
Locking after the while loop still does not solve the problem.
There exists a single-instruction add. i386 allows addl $1,_count
. We can implement count++/--
with one instruction.
This is not atomic on multiprocessor. We will experience the exact same race condition. Can make atomic with lock prefix, but lock is expensive.
Need a solution to critical section problem.
Improved producer
mutex_t mutex = MUTEX_INITIALIZER;
void producer (void *ignored) {
for (;;) {
item *nextProduced = produce_item ();
mutex_lock (&mutex);
while (count == BUFFER_SIZE) {
mutex_unlock (&mutex); /* <--- Why? */
thread_yield ();
mutex_lock (&mutex);
}
buffer [in] = nextProduced;
in = (in + 1) % BUFFER_SIZE;
count++;
mutex_unlock (&mutex);
}
}
Improved consumer
void consumer (void *ignored) {
for (;;) {
mutex_lock (&mutex);
while (count == 0) {
mutex_unlock (&mutex); /* <--- Why? */
thread_yield ();
mutex_lock (&mutex);
}
item *nextConsumed = buffer[out]
out = (out + 1) % BUFFER_SIZE;
count--;
mutex_unlock(&mutex);
consume_item(nextConsumed)
}
}
This works, but its slow. We are spinning.
Busy-waiting in application is a bad idea. It’s better to inform scheduler of which threads can run.
This is done with condition variables.
int pthread_cond_init(pthread_cond_t *, ...)
int pthread_cond_wait(pthread_cond_t * c, pthread_mutex+t *m);
- Automatically unlock and sleep until is signaled
- Then re-acquire and resume executing
int pthread_cond_signal(pthread_cond_t *c);
- Signal wakes up one thread (usually the oldest)
int pthread_cond_broadcast(pthread_cond_t *c);
- Broadcast wakes up all of them
Improved producer
mutex_t mutex = MUTEX_INITIALIZER;
cond_t nonempty = COND_INITIALIZER;
cond_t nonfull = COND_INITIALIZER;
void producer (void *ignored) {
for (;;) {
item *nextProduced = produce_item ();
mutex_lock (&mutex);
while (count == BUFFER_SIZE) {
cond_wait(&nonfull, &mutex);
}
buffer [in] = nextProduced;
in = (in + 1) % BUFFER_SIZE;
count++;
cond_signal(&nonempty); // wake up threads waiting
mutex_unlock (&mutex);
}
}
Improved consumer
void consumer (void *ignored) {
for (;;) {
mutex_lock (&mutex);
while (count == 0) {
cond_wait(&nonempty, &mutex);
}
item *nextConsumed = buffer[out]
out = (out + 1) % BUFFER_SIZE;
count--;
cond_signal(&nonfull);
mutex_unlock(&mutex);
consume_item(nextConsumed)
}
}
Always re-check condition on wake-up
Otherwise it breaks with spurious wakeup or two consumers.
Question
Why must
cond_wait
both release mutex and sleep? Why not separate mutexes and condition variables?
We can end up stuck waiting when bad interleaving. After we unlock in the producer but before we cond_wait
, there’s a context switch to consumer. This is very bad. Producer thread may be blocked forever.
Semaphores (Dijkstra)
A Semaphore is initialized with an integer
int sem_init(sem_t *s, ..., unsigned int n);
Provides two functions:
sem_wait(sem_t *s)
sem_post(sem_t *s)
sem_wait
will return only more times than sem_post
called.
Linux primarily uses semaphores for sleeping locks. Provides elegant solutions for some problems.
Semaphore producer/consumer
Initialize full
to 0 (block consumer when buffer empty)
Initialize empty
to (block producer when queue full)
void producer(void *ignored) {
for (;;) {
item *nextProduced = produce_item ();
sem_wait(&empty);
buffer [in] = nextProduced;
in = (in + 1) % BUFFER_SIZE;
sem_post(&full);
}
}
void consumer(void *ignored) {
for (;;) {
sem_wait(&full);
item *nextConsumed = buffer[out];
out = (out + 1) % BUFFER_SIZE;
sem_post(&empty); // reate new empty block
consume_item(nextConsumed);
}
}
We build a semaphore using a combination of spinlocks and wait channels.
Lecture 12
Semaphore can be 0. Used to block any section from accessing code. Binary semaphores are similar to mutexes. But initialized to an integer allows up to threads to use resources concurrently.
Semaphore implementation
Semaphore_Wait(Semaphore *sem) {
Spinlock_Lock(&sem->sem_lock);
while (sem->sem_count == 0) {
// lock
WaitChannel_Lock(sem->sem_wchan);
// release spinlock
Spinlock_Unlock(&sem->sem_lock);
// wait channel protected by own lock
WaitChannel_Sleep(sem->sem_wchan);
// no locks held
Spinlock_Lock(&sem->sem_lock);
}
sem->sem_count--;
Spinlock_Unlock(&sem->sem_lock);
}
Sepmaphore_Post(Semaphore *sem) {
Spinlock_Lock(&sem->sem_lock);
sem->sem_count++;
WaitChannel_Wake(sem->sem_wchan);
Spinlock_Unlock(&sem->sem_lock);
}
If we released the Spinlock
before we did WaitChannel_Lock
results in a race condition, calling wakeup on nobody.
Midterm Hint:
- If code was changed around, would it still work? What would the schedule look like (order of operations between threads).
Question
Does swapping the last two lines work?
There’s a chance.
Hand-over-hand locking allows for fine-grained locking
Useful for concurrent data structure manipulation
- Hold at most two locks: the previous and next locks
- Locks must be ordered in a sequence
Example
We have locks A, B, C
flowchart LR
A --> B
B --> C
Locking the entire thing would be wasteful of other threads. Hence hand-over-hand locking. We have one lock per entry.
If we want to preserve traversal ordering, normal locking and unlocking doesn’t work. This can lead to race conditions.
lock(A)
// Operate on A
lock(B)
unlock(A)
// operate on B
lock(C)
unlock(B)
// Operate on C
unlock(C)
This is why we interweave the locking. This preserves the correct traversal since we are locking the next entry before unlocking the current entry.
The semaphore implementation uses hand-over-hand locking.
Monitors: Another synchronization primitive.
Programming language construct
- Less error prone than raw mutexes
- A class where only one procedure executes at a time
public class Statistics {
private int counter;
public synchronized int get() { return counter; }
public synchronized void inc() { counter++; }
}
synchronized
key word automatically locks object.
Can implement mutex with monitor or vice versa.
Benign Races
In some cases we are okay with races. Sometimes “cheating” buys efficiency.
If we care more about speed than accuracy then this is okay.
if (!initialized) {
lock(m);
if (!initialized) { initialize(); initialize = 1; }
unlock(m);
}
Usually not recommended.
Lecture 13
Can use static methods to detect data races (though not very good).
One of the tools is Eraser. Keeps track of all locks (lockset). One each access, remove any locks not currently held. If lockset becomes empty, abort (introduced a race condition).
Deadlock Problem
mutex_t m1, m2;
void f1(void *ignored) {
lock(m1);
lock(m2);
// cs
unlock(m2);
unlock(m1);
}
void f2 (void *ignored) {
lock(m2);
lock(m1);
// cs
unlock(m1);
unlock(m2);
}
Thread 1 acquires m1
, no longer running. context switch to f2, acquires m2
, but blocks when trying to acquire m1
. Context switch back to f1
, tries to acquire m2
, gets blocked. This will be stuck forever.
Can only unlock something if we own the lock.
Examples of deadlocks without locks
funcA()
lock(mutex)
while (x == 0) {
cond_wait(cond_x, mutex) // waits on condx, releases mutex
}
y++
unlock(mutex)
funcB()
lock(mutex)
while (y == 0) {
cond_wait(cond_y, mutex)
}
x++
unlock(mutex)
If the value of x
and y
are both 0 at the same time the thread would sleep. However a third thread can fix this.
Deadlock with combined mutex/condition variable deadlock. Need to be careful when using library functions.
mutex_t a, b;
cond_t c;
lock(a); lock(b); while (!ready) wait(b, c); // releases b, waiting on cond c
unlock(b); unlock(a);
lock(a); lock(b); ready = true; signal(c);
unlock(b); unlock(a);
Danger
Dangerous to hold locks when crossing abstraction barriers.
Deadlock conditions (all 4 must be true in order for a deadlock to occur)
- Limited access (mutual exclusion)
- No locks means no deadlocks (but means race conditions, just as bad as deadlock). Usually can’t eliminate this condition
- No preemption
- Once resource granted, can’t be taken away. Eliminates deadlocks, but there’s a chance we get a livelock. Everyone is actively doing work, but none of it is useful.
- Multiple independent requests
- Don’t ask all at once
- Circularity in graph of requests
- This is the condition we try to remove
Two approaches to dealing with deadlock
- Pro-active: prevention
- Reactive: detection corrective action
Can view the system as a graph. Processes and resources are nodes. Resource requests and assignments are edges.
If a graph has no cycles no deadlock.
If graph contains a cycle. Definitely a deadlock if only one instance per resource. Otherwise, maybe deadlock, maybe not.
Lecture 14
Amdahl’s Law
- : The time one core takes to complete the task
- : The fraction of the job that must be serial (not parallel)
- : The number of cores (Suppose this was )
Amdahl’s law places an ultimate limit on parallel speedup.
Cache Coherence Protocols
Each core has its own cache. Creates opportunity for the caches to disagree about memory.
Caches leverage temporal locality. Pulling from memory to cache can boost performance. Also spatial locality since we are moving cache lines (usually 64 bytes).
In a cache coherent system, all cached copies of a region of memory must be the same. Without cache coherence, it is possible for a CPU to read stale data.
Snoopy MSI Protocol is a simple cache coherence protocol.
Each cache line is in one of three states. Each core snoops on the memory bus to determine if it needs to update the state of its cache lines.
Modified
- Cache line has been updated, other caches must be in invalid state
- Data in cache line is different than what is in memory
- Must write data back to memory when transitioning out of this state
Shared
- One or more caches (and memory) have a valid copy
Invalid
- Doesn’t contain any data
Operations that are performed in response to a processor request.
stateDiagram-v2
S --> M: PrWr/BusUpgr
I --> M: PrWrBusRdX
I --> S: PrRd/BusRd
M --> M: PrRd/-PrWr/-
S --> S: PrRd/-
Operations that are performed in response to a processor seeing a bus request
stateDiagram-v2
M --> S: BusRd/-
M --> I: BusRdX/Flush
S --> I: BusRdX/-BusUpgr/-
S --> S: BusRd/-
I --> I: BusRd/-BusRdX/-BusUpgr/-
These states are per cache line.
CPU cycles to access cache/memory (Intel Xeon):
- L1: 3 cycles, L2: 11 cycles, LLC (last level cache): 44 cycles, RAM: 355 cycles
If a remote core has data cached in the modified state
- Load: 109 cycles
- Store: 115 cycles
Implications for multithreaded design
- Avoid false sharing
- Data is shared in cache line chunks
- Avoid placing data used by the different threads in the same cache line
- Align structures to cache lines
- Place related data you ned to access together
- Pad data structures
- Arrays of structures lead to false sharing
- Add unused fields to ensure alignment
- Avoid contending on cache lines
- Reduce costly cache coherence traffic
Problem: twothreads
Assume
. Add at the end to account for the first thread.
Now assume that
Lecture 15
CPU accesses physical memory over a memory bus. Some devices are connected to the system via a general I/O bus. Slow devices are connected to a peripheral bus.
Device drivers communicate with devices by accessing/modifying their registers.
Question
How can a device driver access driver registers?
Option 1: Special I/O instructions
- Such as in and out instructions on x86
- Device registers are assigned port numbers
- Instructions transfer data between a specified port and a CPU register
Option 2: Memory-mapped I/O
- Each device register has a physical memory address
- Device drivers can read from or write to device registers using normal load and store instructions, as though accessing memory.
A device driver is a part of the kernel that interacts with a device. Example: write character to serial output device
// only one writer at a time
P(output device write semaphore)
// trigger the write operation
write character to device data register
repeat {
read writeIRQ register
} until status is "completed"
write writeIRQ register to ack completion
V(output device write semaphore)
This example illustrates polling: The kernel driver repeatedly checks device status.
We can use interrupts to avoid polling.
Device driver read handler:
// only one disk request at a time
P(disk semaphore)
write target sector number to disk sector number register
write "read" command to disk status register
// wat for request to complete
P(disk completion semaphore)
copy data from device transfer buffer to memory
V(disk semaphore)
Interrupt handler for disk device:
// make the device ready again
write disk status register to ack completion
V(disk completion semaphore)
Direct Memory Access (DMA)
DMA is used for block data transfers between devices (e.g. a disk) and memory. Under DMA, the CPU initiates the data transfer and is notified when the transfer is finished. However, the device (not the CPU) controls the transfer itself.
- CPU issues DMA request to device
- Device directs data transfer
- Device interrupts CPU on completion
Anatomy of a disk
- Stack of magnetic platters
- Disk arm assembly
- Arms rotate around pivot, all move together
- Arms contain disk heads-one for each recording surface
- Heads read and write data to platters
Platters are divided into concentric tracks. A stack of tracks of fixed radius is a cylinder. Heads record and sense data along cylinders. Significant fractions of encoded stream for error correction.
Generally only one head active at a time. Disks usually have one set of read-write circuitry.
Logical view of a disk drive
- Disk is an array of numbered sectors
- Each sector is the same size
- Sectors are the unit of transfer between the disk and memory
- Storage is non-volatile, i.e., data persists even when the device is without power
Disk positioning system
- Move head to specific track and keep it there
- Resists physical shocks
- A seek consists of up to four phases:
- Speedup: Accelerate arm to max speed or half way point
- Coast: At max speed (for long seeks)
- Slowdown: Stops arm near destination
- Settle: Adjusts head to actual desired track
- Very short seeks dominated by settle time ( ms)
- Short (200-400 cyl.) seeks dominated by speedup
- Accelerations of
- Average seek time is roughly one-third of the full seek time
Moving data to/from a disk involves:
- Seek time: move the read/write heads to the appropriate cylinder
- Rotational latency: wait until the desired sectors spin to the read/write heads
- Transfer time: wait while while the desired sectors spin past the read/write heads
Request service time is the sum of seek time, rotational latency, and transfer time.
Seek Time:
- If the next request is for data on the same track as the previous request, seek distance and seek time will be zero
- In the worst-case, e.g., seek from outermost track to innermost track, seek time may be 10 milliseconds or more.
Rotational latency:
- Consider a disk that spins at RPM
- One complete rotation takes 5 milliseconds
- Rotational latency ranges from ms to ms
Transfer time:
- Once positioned, the RPM disk can read/write all data on a track in one rotation (ms)
- If only percent of the tracks sector are being read/written, transfer time will be percent of the complete rotation time (ms).
Lecture 16
The placement on disk is very important. If randomized, every sector might be on a different track. Most of the time will be spent seeking rather than transferring data. Random I/O is terrible (same concept from CS348).
We try to achieve contiguous accesses where possible. Try to order requests to minimize seek times.
I/O Scheduling Policies:
First Come First Served (FCFS): Process disk requests in the order they are received
Advantages
- Easy to implement
- Good fairness
Disadvantages
- Cannot exploit request locality (long seek time)
- Increases average latency, decreasing throughput
Shortest Positioning Time First (SPTF): Always pick request with shortest seek time. Also referred to as Shortest Seek Time First.
Advantages
- Exploits locality of disk requests
- Higher throughput
Disadvantages
- Starvation (one outlier neglected)
Question
How do we fix the starvation?
Give older requests higher priority. Aged SPTF. Adjust effective seek time with weighting factor:
Can be modelled as an elevator problem.
Elevator scheduling (SCAN): Sweep across disk, serving all request passed. Like SPTF, but next seem must be in the same direction. Switch directions only if no further requests.
Advantages
- Takes advantage of locality
- Bounded waiting
Disadvantages
- Cylinders in the middle get better service
- Might miss locality SPTF could exploit
CSCAN: Only sweep in one direction
VSCAN(): Continuum between SPTF and SCAN. Similar to SPTF, but slightly changes “effective” positioning time. If request in same direction as previous seek: . Otherwise:
Where we get SPTF, when , get SCAN. works well.
Advantages and disadvantages are those of SPTF and SCAN depending on how is set.
Now, people increasingly use flash memory/solid state drives.
Completely solid state (no moving parts).
DRAM: Requires constant power to keep values
- Transistors with capacitors
- Capacitor holds microsecond charge; periodically refreshed by primary power
Flash Memory: Traps electrons in quantum cage using floating gate transistors.
Instead of sectors, SSD arranges data into pages and blocks.
- 2, 4, or 8 pages further divided into 32KB-4MB blocks
We perform operations (read/writes) at the page level. Pages are initialized to 1s; can transition from 1 to 0 at page level (write new page). A high voltage is required to switch from 0 to 1 (overwrite/delete). Cannot apply high voltage at page level, only to blocks.
Writing and deleting from flash memory. Naïve solution is to read whole block into memory. Re-initialize block. Update block in memory; write back to SSD.
Translation layer solution: Mark page to be deleted/overwritten as invalid, write to an unused page, update translation table, requires garbage collection.
SSDs are not impervious. Blocks have a limited number of write cycles. Repeated high voltage will result in wear on the SSD. If a block is no longer writeable; it becomes read-only. When a certain number of blocks are read-only; the disk becomes read-only.
SSD controller wear-levels; ensuring that write cycles are evenly spread across all blocks.
Lecture 17
Most of lecture is midterm take up which I can’t post here.
File systems: traditionally the hardest part of OS.
Main tasks:
- Don’t go away
- Associate bytes with names
- Associate names with names
Files are persistent, named data objects
- Data consists of a sequence of numbered bytes
- File may change size over time
- File has associated meta-data
File systems are the data structures and algorithms used to store, retrieve, and access files. Translating name and offset to disk blocks.
File Interface: Basics
open
- Returns a file identifier (or handle or descriptor), which is used in subsequent operations to identify the file.
- Other operations require file descriptor as a parameter.
close
- Kernel tracks which file descriptors are currently valid for each process.
close
invalidates the file descriptor.
read
,write
, seek
read
copies data from a file into a virtual address space. Returns the number of bytes read.write
copies data from a virtual address space into a file.seek
enables non-sequential reading/writing.
Each file descriptor (open file) has an associated file position. The position starts at byte 0 when the file is opened.
Read and write operations start from the current file position and update the current file position as bytes are read/written. This makes sequential file I/O easy for an application to request.
Seeks (lseek
) are used to achieve non-sequential I/O. It changes the file position associated with a descriptor.
Sequential file reading example
char buf[512];
int i;
int f = open("myfile", O_RDONLY);
for (i = 0; i < 100; i++) {
read(f, (void *)buf, 512);
}
close(f);
Read the first bytes of a file, bytes at a time.
Read using seek in reverse order example.
char buf[512];
int i;
int f = open("myfile", O_RDONLY);
for (i = 1; i <= 100; i++) {
lseek(f, (100-i)*512, SEEK_SET);
read(f, (void *)buf, 512);)
}
close(f);
Read the first bytes of a file, bytes at a time, in reverse order.
A directory maps file names (strings) to i-numbers. An i-number is a unique (within a file system) identifier for a file or directory. Given an i-number, the file system can find the data and meta-data for the file.
Directories provide a way for applications to group related files. Since directories can be nested, a filesystem’s directories can be viewed as a tree, with a single root directory. In a directory tree, files are leaves.
Files may be identified by pathnames, which describe a path through the directory tree from the root directory to the file.
Directories also have pathnames. Applications refer to files using pathnames, not i-numbers.
Lecture 18
A hard link is an association between a name (string) and an i-number. Each entry in a directory is a hard link.
When a file is created, so is a hard link to that file. Once a file is created, additional hard links can be made to it. For example, link(/docs/a.txt,/foo/myA)
creates a new hard link myA
in directory /foo
.
Linking to an existing file creates a new pathname for that file. A file has a unique i-number but may have multiple pathnames. Not possible to link to a directory.
Hard links can be removed. When the last hard link to a file is removed, the file is also removed. Since there are no links, there is no pathname, and the file cannot be opened.
It is not uncommon for a system to have multiple file systems. Some kind of global file namespace is required. DOS/Windows uses two-part file names. The file system name, and the pathname within the file system.
Unix: Creates a single hierarchical namespace that combines the namespace of two file systems.
Mounting does not make two file systems into one file system, it creates a single hierarchical namespace that combines the namespaces of two file systems.
Question
What needs to be stored persistently?
- File data
- File meta-data
- Directories and links
- File system meta-data
Non-persistent information
- Open files per process
- File position for each open file
- Cached copies of persistent data
We will use an extremely small disk as an example (256 KB disk).
Most disks have a sector size of 512 bytes. 512 total sectors on this disk.
Group every 8 consecutive sectors into a block. Better spatial locality and reduces the number of block pointers.
VSFS: Very Simple File System
Most of the blocks should be for storing user data (last 56 blocks).
Need some way to map files to data blocks. We create an array of nodes, where each node contains the meta-data for a file. The index into the array is the file’s index number (number)
Assume each node is 256 bytes, and we dedicate 5 blocks for nodes. This allows for 80 total nodes/files
We also need to know which nodes and blocks are unused. There are many ways to do this but in VSFS, we use a bitmap.
A block size of 4 KB means we can track 32K nodes and 32K blocks. This is more than we actually need.
We store an node bitmap as well as a node bitmap.
We reserve the first block as the superblock. A superblock contains meta-information about the entire file system.
An node is a fixed size index structure that holds both file meta-data and a small number of pointers to data blocks.
Assume disk blocks can be referenced based on a 4 byte address. In VSFS, an node is 256 bytes. Assume there is enough room for 12 direct pointers to blocks. Each pointer points to a different block for storing user data. Pointers are ordered: First pointer points to the first block in the file.
Question
What is the maximum file size if we only have direct pointers?
KB
In addition to 12 direct pointers, we can also introduce an indirect pointer. This pointer points to a block full of direct pointers.
4 KB block of direct pointers 1024 pointers. Maximum file size is: KB = 4144 KB
What if the disk were larger? We can add a double indirect pointer.
Lecture 19
File System Design
File system parameters:
- How many nodes should a file system have?
- How many direct and indirect blocks should an node have?
- What is the “right” block size?
For a general purpose file system, design it to be efficient for the common case.
- Most files are small
- Average file size is growing
- File systems contain lots of files
- Directories are typically small
- A few big files use most of the space
In-Memory (Non-persistent) Structures
Per process
- Which file descriptors does this process have open?
- To which file does each open descriptor refer?
- What is the current file position for each descriptor?
System wide
- Open file table: Which files are currently open
- node cache: In-memory copies of recently-used nodes
- Block cache: In-memory copies of data blocks and indirect blocks
Reading from a File
First read the root node
- At “well known” position (node 2)
- node 1 is usually for tracking bad blocks
data bitmap | inode bitmap | root inode | foo inode | bar inode | root data | foo data | bar data[0] | bar data[1] | bar data[2] | |
---|---|---|---|---|---|---|---|---|---|---|
open(bar) | read | read | ||||||||
read | ||||||||||
read | ||||||||||
read | ||||||||||
read() | read | |||||||||
read | ||||||||||
write | ||||||||||
read() | read | |||||||||
read | ||||||||||
write | ||||||||||
read() | read | |||||||||
read | ||||||||||
write |
Then read the directory information from root. Find the number for foo
and read the foo
node.
We then find the number for bar
and read the bar
node.
Permission check (is the user allowed to read this file?). We allocate a file descriptor in the per-process descriptor table. Increment the counter for this entry in the global open file table.
Find the block using a direct/indirect pointer and read the data. Update the node with a new access time. Update the file position. Closing a file deallocates the file descriptor and decrements the counter for its entry in the global open file table.
data bitmap | inode bitmap | root inode | foo inode | bar inode | root data | foo data | bar data[0] | bar data[1] | bar data[2] | |
---|---|---|---|---|---|---|---|---|---|---|
create (/foo/bar) | read | |||||||||
read | ||||||||||
read | ||||||||||
read | ||||||||||
read | ||||||||||
write | ||||||||||
write | ||||||||||
read | ||||||||||
write | ||||||||||
write | ||||||||||
write() | read | |||||||||
read | ||||||||||
write | ||||||||||
write | ||||||||||
write | ||||||||||
write() | read | |||||||||
read | ||||||||||
write | ||||||||||
write | ||||||||||
write | ||||||||||
write() | read | |||||||||
read | ||||||||||
write | ||||||||||
write | ||||||||||
write |
Lecture 20
For a file, an entry is created in local file table and open file table. Reference count, offset, and -node are stored in the open file table.
When forking a parent, the reference count in open file table increases. This is because there are two pointers to the same entry now.
Chaining is similar to a linked list, store pointer to next block in current block. External chaining puts these pointers to an external chain. Advantage of this approach: faster since we can cache (in memory vs on disk).
A single local file system operation may require disk I/O operations.
Example
Deleting a file (this is not atomic)
- Remove entry from directory (remove hard link)
- Remove file index (-node) from -node table
- Mark file’s data blocks free in free space index.
Question
What if, because of a failure, not all of these changes are reflected on disk?
System failure will destroy in-memory file system structures.
Persistent structures should be crash consistent. They should be consistent when system restarts after a failure.
Another more pressing example, moving files. Moving a file from /foo/a.txt
to /bar/a.txt
- Update
foo
’s data block - Update
foo
’s -node - Update
bar
’s data block - Update
bar
’s -node
Cutting off the power after step 2, before step 3. The only hard link to the file is gone. Data is still on the hard drive but can’t locate. Move has become a bad delete (file still has a reference count of 1, but with no reference).
We can flip the ordering
- Update
bar
’s -node - Update
bar
’s data block - Update
foo
’s data block - Update
foo
’s -node
Now if it breaks after 2, our move has become a copy. User will probably want to delete file from foo
. But this is dangerous since we never updated the reference count of the -node of a.txt
(it is , not ). This removes the entire file.
Question
Why not just update the reference / hard link count before and after?
There is overhead. This isn’t necessary if we didn’t think it would crash.
Fault tolerance:
- Special-purpose consistency checkers
- Runs after a crash, before normal operations resume
- Find and attempt to repair inconsistent file system data structures (i.e. file with no directory entry, free space that is not marked as free)
This can take an extremely long time.
- Journaling
- Record file system meta-data changes in a journal, so sequences of changes can be written to disk in a single operation
- After changes have been journaled, update disk data structures
- After a failure, redo journaled updates in case they were not done before the failure
- We use checkpoints to remove old logs
- We can do this because updating blocks is idempotent
Every process has its own virtual address space. Great from abstraction perspective.
Amount of physical memory is determined by the physical address size. Physical address is 18 bits 256KB total physical memory
If physical addresses have bits, the maximum amount of addressable physical memory is
The actual amount of physical memory on a machine may be less than the maximum amount that can be addressed.
Size of virtual address space does not have to be the same size as the physical address space.
If virtual addresses are 16 bits, maximum virtual memory size is KB
Question
How do we map virtual memory to physical memory?
The kernel provides a separate, private virtual memory for each process. Virtual memory of a process holds the code, data, and stack for the program that is running in that process.
If virtual addresses are bits, the maximum size of a virtual memory is bytes.
Running applications see only virtual addresses. Program counter and stock pointer hold virtual addresses. Pointers to variables are virtual addresses. Jumps and branches refer to virtual addresses.
Each process is isolated in its virtual memory, and cannot access other process’ virtual memories.
Each virtual memory is mapped to a different part of physical memory. When a process tries to access (load or store) a virtual address, the virtual address is translated (mapped) to its corresponding physical address, and the load or store is performed in physical memory.
Address translation is performed in hardware, using information provided by the kernel.
Dynamic allocation keeps track of an offset and a limit. Offset tells us where to map, and limit tells us the size of the virtual memory we are mapping.
CPI includes a memory management unit (MMU), with a relocation register and a limit register.
- Relocation register holds the physical offset () for the running process’ virtual memory
- Limit register holds the size of the running process virtual memory
To translate a virtual address to a physical address :
if v >= L // generate exception
else:
p = v + R
Translation is done in hardware by MMU.
Lecture 21
The kernel maintains a separate and for each process, and changes the values in the MMU registers when there is a context switch between processes.
Each virtual address space corresponds to a contiguous range of physical addresses.
The kernel is responsible for deciding where each virtual address space should map in physical memory. The OS must track which parts of physical memory are in use, and which parts are free. OS must allocate/deallocate variable-sized chunks of physical memory.
This creates the potential for fragmentation of physical memory.
There are two types of fragmentation. The above is external fragmentation, but there also exists internal fragmentation.
Example
Dynamic Relocation Examples
Process :
Limit Register (): 0x0000 7000
Relocation Register (): 0x0002 4000
v = 0x102c
Since v
is less than , we add : p = 0x2502c
v = 0x8800
p = Exception
v = 0x0000
p=0x24000
Paging: Physical Memory
Refer to physical memory as frames (instead of physical pages). Refer to virtual memory as pages.
The size of a page should be the same size as a frame. Each frame size 4KB.
Each page maps to a different frame. Any page can map to any frame.
4KB is the smallest amount we can allocate. If we only need 1 byte, we still must assign at least 4KB. This is internal fragmentation. We cannot safely reassign this empty space to something else.
There is a tradeoff between internal and external fragmentation.
Question
Why not smaller page sizes?
Introduces a lot of overhead.
An alternative approach to having many registers is to store the relocation values in memory (as a page table).
Separate page table for every single process
Process Page Table
Page | Frame | Valid? |
---|---|---|
0x0 | 0x0f | 1 |
0x1 | 0x26 | 1 |
0x2 | 0x27 | 1 |
0x3 | 0x28 | 1 |
Valid tells us if memory has been allocated or not.
The MMU includes a page table base register which points to the page table for the current process.
How the MMU translates a virtual address:
- Determines the page number and offset of the virtual address
- Page number is virtual address divided by the page size
- Offset is the virtual address modulo the page size (the byte we are interested in)
- Looks up the page’s entry (PTE) in the current process page table, using the page number
- If the PTE is not valid, raise an exception
- Otherwise, combine page’s frame number from the PTE with the offset to determine the physical address
- Physical address is (frame number frame size) offset
Process virtual address (16 bits)
The last 12 bits are the offset, the first 4 bits are the page number.
We only send the first 4 bits to the page table lookup to determine the frame number for the physical address. These 4 bits become 6 bits (frame number). The last 12 bits are the same offset in the virtual address and physical address (this is because the size of the frame is the same as the size of the page).
Example
Address Translation
Page | Frame | Valid? |
---|---|---|
0x0 | 0x0f | 1 |
0x1 | 0x26 | 1 |
0x2 | 0x27 | 1 |
0x3 | 0x28 | 1 |
0x4 | 0x11 | 1 |
0x5 | 0x12 | 1 |
0x6 | 0x13 | 1 |
0x7 | 0x00 | 0 |
0x8 | 0x00 | 0 |
Last 3 digits are for offset, first 3 are for page number.
v = 0x102c
Page number is 0x1
Frame 0x26
with offset of 02c
p = 0x2602c
v = 0x9800
Page number is 0x9
Frame 0x00
p = Exception
v = 0x0024
Page number is 0x0
Frame 0x0f
with offset 024
p = 0x0f024
PTEs may contain other fields, in addition to the frame number and valid bit.
Example 1: Write protection bit
- Can be set by the kernel to indicate that a page is read-only
- If a write operation to memory uses a virtual address on a read-only page, the MMU will raise an exception when it translates the virtual address
Example 2: Bits to track page usage
- Reference (use) bit: has the process used this page recently?
- Dirty bit: have contents of this page been changed?
- These bits are set by the MMU, and read by the kernel
A page table has one PTE for each page in the virtual memory
- Page table size = (number of pages) (size of PTE)
- Number of pages = (virtual memory size) (page size)
The page table: 64KB of virtual memory, with 4KB pages, is 64 bytes, assuming 32 bits for each PTE. Page tables for larger virtual memories are larger.
Page tables are kernel data structures.
Kernel:
- Manage MMU registers on address space switches
- Create and manage page tables
- Manage physical memory
- Handle exceptions raised by the MMU
MMU:
- Translate virtual addresses to physical addresses
- Check for and raise exceptions when necessary
Execution of each machine instruction may involve one, two or more memory operations. One to fetch instructions, one or more for instruction operands.
Address translation through a page table adds one extra memory operation (for PTE lookup) for each memory operation performed during instruction execution.
This can be slow.
Solution: Include a Translation Lookaside Buffer (TLB) in the MMU
- TLB is a small, fast, dedicated cache of address translations, in the MMU
- Each TLB entry stores a (page frame) mapping
What the MMU does to translate a virtual address on page :
if there is an entry (p, f) in the TLB then
return f // TLB hit
else
find p's frame number (f) from the page table
add (p,f) to the TLB, evicting another entry if full
return f // TLB miss
If the MMU cannot distinguish TLB entries from different address spaces, then the kernel must clear or invalidate the TLB on each context switch from one process to another.
Contiguous memory allocation is not common.
Lecture 22
MIPS: , max virtual memory size is bytes
x86-64: , max virtual memory size is bytes
Much of the virtual memory may be unused. Application can use discontiguous segments of virtual memory
Example
Sizes
32 bit address space.
Each page size is 4kb
Question
How many bits is the offset?
12 bit offset. = 4KB. Size of the page tells us how big the offset is. Page number is 20 number of page table entries . Each page table entry is 4 bytes. Page table is actually larger than the virtual memory, most of the page table entries are marked as invalid.
Segmentation: Instead of mapping the entire virtual memory to physical, we can provide a separate mapping for each segment of the virtual memory that the application actually uses. The kernel maintains an offset and limit for each segment.
With bits for the segment ID, we can have up to:
- segments
- bytes per segment
Kernel decides where each segment is placed in physical memory.
Better than dynamic allocation, but there still exists external fragmentation.
Many different approaches for translating segmented virtual addresses.
Approach 1: MMU has a relocation register and a limit register for each segment.
Let be the relocation offset and be the limit offset for the th segment
To translate virtual address to a physical address :
split v into segment number (s) and
address within segment (a)
if a >= L_s then generate exception
else
p = a + R_s
Example
Segmented Address Translation
Limit Register 0: 0x2000
Limit Register 1: 0x5000
Relocation Register 0: 0x38000
Relocation Register 1: 0x10000
0x1240
Segment = 0
Offset = 0x1240
0x39240
(Relocation register 0 + Offset)
First bit is the segment bit. The rest is the address. Breaking up hex 1 into binary 0001
. First bit is 0
segment = 0
. We only use 1 bit () to determine the segment (since there are only two segments).
0xa0a0
Segment = 1
Offset = 0x20a0
0x120a0
(Relocation Register 1 + Offset)
0x66ax
Segment = 0
Offset = 0x66ac
Exception
Translating segmented virtual addresses, we can maintain a segment table.
Virtual address is broken up into segment number and offset.
We determine if the segment number is valid. If it is valid, it is an index into the segment table. The segment table base register tells us where the segment table is located in the kernel. Each entry in the segment table stores the size and the start address (relocation value). To get the physical address, we add the start from the segment table with the offset.
Two level paging: Instead of having a single page table to map an entire virtual memory, we can split the page table into smaller page tables, and add a page table directory.
Each virtual address has three parts:
- Level one page number: used to index the directory
- Level two page number: used to index a page table
- Offset within a page
16 bits
- Page number is 4 bits
- Offset is 12 bits
Number of page table entries is 16.
Alternatively: 16 bits
- is 2 bits
- is 2 bits
- Offset is 12 bits
First four entries are mapped to first value. We store the pointer to the second level page table (that stores these 4 entries). The first two bits are used to index into the first level page. The remaining two bits are used to index within the second level page table.
The MMU’s page table base register points to the page table directory for the current process.
Each virtual address has three parts:
How the MMU translates a virtual address:
- Index into the page table directory using to get a pointer to a 2nd level page table
- If the directory entry is not valid, raise an exception
- Index to the 2nd level page table using to find the PTE for the page
- If the PTE is not valid, raise an exception
- Otherwise, combine the frame number from the PTE with to determine the physical address being accessed.
One goal of two-level paging was to keep individual page tables small.
Suppose we have a 40 bit virtual address and that
- PTE is 4 bytes
- page size is 4KB ( bytes)
- We’d like to limit each page table’s size to 4KB
40 bits
- Offset is 12 bits
- 4 (pte) = K . Thus the number of bits for is 10.
- gets 18 bits ()
Each of the second level page table is exactly 4KB in size. Our page directory will have to have entries (which is large). How can we shrink this if we must (i.e. max size must be 4KB)?
- We can break up further into and
Thus we need three levels of paging to ensure that each page table’s size is no greater than 4KB.
This is multi-level paging.
Properties of multi-level paging
- Can map large virtual memories by adding more levels
- Individual page tables/directories can remain small
- Can avoid allocating page tables and directories that are not needed for programs that use a small amount of virtual memory
- TLB misses become more expensive as the number of levels go up, since more directories must be accesses to find the correct PTE
We want the kernel to live in virtual memory, but there are challenges.
- Bootstrapping: Since the kernel helps to implement virtual memory, how can the kernel run in virtual memory when it’s just starting?
- Sharing: Sometimes data needs to be copied between the kernel and application programs. How can this happen in different virtual address spaces.
We can make the kernel’s virtual memory overlap with process’ virtual memories.
Exploiting secondary storage
Goals
- Allow virtual address spaces that are larger than the physical address space
- Allow greater multiprogramming levels by using less of the available (primary) memory for each process
Method:
- Allow pages from virtual memories to be stored in secondary storage
- Swap pages between secondary storage and primary memory so that they are in primary memory when they are needed