Professor: Chengnian Sun | Term: Fall 2024
Lecture 1
Question
What is a sequential program?
Question
What really happens when I compile and run a program?
Question
How does a computer take code and turn it into something it can utilize?
By the end of the course, there should be very little mystery left about computers or computer programs.
High level overview of compilers
Compiler: Scans (turns into substrings) Parses (parse tree) {front end} Optimizes {middle end} (tree representation of program) Codegen {backend} (generate MIPS/ARM/x86/RISC-V)
Basic Definitions
Definition
A bit is a binary digit (0 or 1)
Definition
A nibble is 4 bits (1001)
Definition
A byte is 8 bits (10011101)
Definition
A word is a machine-specific grouping of bytes. For us, a word will be 4 bytes (32-bit architecture) though 8-byte (64-bit architecture) words are more common now.
Definition
The base-16 representation is called the hexadecimal system. It consists of the numbers from 0-9 and the letters a-f (convert the number from 10 to 15 in decimal).
The binary number will convert to in hexadecimal. Break it into nibbles.
and
, the denotes a hexadecimal representation (or can do ). A conversion table will be provided in the exam.
Bytes as binary numbers
- Unsigned (non-negative integers)
- Signed integers
Unsigned
The value of a number stored in this system is the binary sum, that is
where is either or
For example,
A byte (unsigned) can represent numbers
Converting from decimal to binary:
Approach 1: Take the largest power of less than , subtract and repeat.
Approach 2: Repeatedly divide by
The remainder (from bottom to top) is the binary representation
(when ). Remainder
Signed Integers
Question
How do we represent negative integers?
Attempt 1: Make the first bit a signed bit. This is called the “sign-magnitude” representation.
represents and represents , use the rest of the bits to create the number.
We get positive and negative zero.
Attempt 2: Two’s complement form
Similar to sign-magnitude in spirit. First bit is if non-negative, if negative (MSB is either )
Negate value by just subtracting from zero and letting it overflow.
A trick to get the same thing:
- Take the complement of all bits and then add 1.
- Or, locate the rightmost bit and flip all the bits to the left of it.
Decimal to Two’s Complement
Compute using one byte of space. First write in binary.
Negate this number
in 2’s complement.
Two’s Complement to Decimal
To convert to decimal, one method is to flip the bits and add (or do the shortcut). Then compute and convert positive to negative number.
Another way is to treat as unsigned and subtract
Lecture 2
Warmup
to twos complement
(absolute value)
Convert to twos complement (shortcut)
to twos complement
Convert to twos complement (shortcut)
to twos complement
Convert to absolute value using shortcut
This is wrong. Cannot represent this way (but can represent )
if (a >= 0) return a;
else return -a;
Bytes as Characters
ASCII uses bits to represent characters.
Note that ‘a’ is different than 0xa. Former is the decimal number in ACII, latter is the number in decimal.
Prints $0$.
Bit-Wise Operators
Suppose we have
unsigned char a = 5, b = 3
Bitwise not ~: negate bits (unary operator)
Bitwise and &: (binary operator)
Bitwise or : (binary operator)
Bitwise exclusive or ^: (binary operator)
^
Returns 1 bits are different
Bitwise shift right or left and (still dealing with unsigned)
Discard rightmost bits and fill left
where is the number of bits shifted
Discard leftmost bits and fill right
where is the number of bits shifted
Computer Programs
Question
What is a Computer Program?
Programs operate on data.
Programs are data. This is a von Neumann architecture: Programs live in the same memory space as the data they operate on.
Programs can manipulate other programs.
We will use 32-bit MIPS in this course
CPU
- 32 general purpose registers
- Control Unit
- Memory
- Registers
- L1 cache
- L2 cache
- RAM
- Disk
- Network memory
- ALU
Registers are very fast memory.
Some general-purpose registers are special:
$0
is always$31
is for return address$30
is our stack pointer$29
is our frame pointer
Code is just data: it is stored in RAM.
MIPS takes instructions from RAM and attempts to execute them.
Recall Problem: We only know from context what bits have what meaning, and in particular, which are instructions.
Solution: Convention is to set memory address in RAM to be an instruction.
Question
Problem: How does MIPS know what to do next?
Solution: Have a special register called the Program Counter (PC) to tell us what instruction to do next.
Question
Problem: How do we put our program into RAM?
Solution: A program called a loader puts our program into memory and sets the PC to be the first address.
Fetch-Execute Cycle
PC = 0
while true do
IR = MEM[PC] // loading the first word/instruction into IR
PC += 4 // update PC to the next instruction
Decode and execute instruction in IR // note we increment PC first
end while
This is the only program a machine really runs. This is the essence of the CPU cycle.
First Example: Addition
Write a program in MIPS that adds the values in registers $8
and $9
and stores the result in register $3
add:
Add
000000ssssstttttddddd00000100000
$s = source
, $t = second source
, $d = destination
Putting values in Registers
lis
: Load Immediate & Skip
0000000000000000ddddd00000010100
Load value into register $3
lis: (000 ... 00011 ...)
load and escape to the next instruction
value: 000 ... 1
(which is )
$d = MEM [PC]; PC += 4
Lecture 3
Recall
Fetch-Execute Cycle
PC = 0 // mem address of next instruction
while true {
IR = MEM[PC]
PC = PC+4
Decode IR then execute
}
add
: Add
s = source
, t = second source
, d = destination
lis
: Load Immediate & Skip, for putting values directly into registers.
$3 = 1
0x0: 000... 00 00011 0....1..0 (d = 3)
0x4: 0000000000000000000000000000001 (value 1)
$d = MEM[PC]; PC + 4
$d = MEM[PC] // load
$3 = MEM[0x4]
PC becomes 0x8
Add values and and store the result in register .
0x0: lis 01000 //
0x4: 11 as a 32 bit word
0x8: lis 01001
0xc: 13 as a 32 bit word
0x10: add 01000 and 01001 and store in 00011
Question
How do we stop?
Operating system provides a return address in register $31
.
We get to that return address via jr
(Jump Register).
Multiplying two words together might give a word that requires twice as much space.
To deal with this, we use two registers: hi
and lo
.
hi
has the leftmost 32 bits, and lo
has the rightmost 32 bit.
Division performs integer division and stores the quotient in lo
and remainder in hi
.
hi
and lo
are special purpose registers: they don’t have register numbers.
mfhi
and mflo
are general purpose registers.
mfhi: $3
11 * $3 = hi
Larger amount of memory is stored off the CPU.
RAM access is slower than register access (but is larger).
Data travels between RAM and CPU via the bus.
Words occur every 4 bytes, starting with byte 0. Indexed by .
Cannot directly use the data in the RAM. Must transfer first to the registers.
Operations on RAM.
Load word takes a word from RAM and places it into a register. Specifically, load the word in MEM[$s + i]
and store in $t
.
load $t $s(i)
where $t: result reg; $s: mem address; i: 16-54 bit signed offset (number)
0x4444
lis $3 <- mem address
binary word for 4
load $3, $3(0)
// register 3 has the value in memory address 4
Load Read and Store Write
Store word takes a word from a register and stores it into RAM. Specifically, load the word in $t
and store it in MEM[$s + i]
.
store $0, $3(0)$
MEM[$3 + 0] = 0
Machine code is difficult to memorize and edit.
Assembly language is a text language which there is a 1-to-1 correspondence between assembly instructions and machine code instructions. This is more human-readable. Higher-level languages have a more complex mapping to machine code.
Lecture 4
A string like add $3, $2, $1
is stored as a sequence of characters.
If we can break this down to meaningful chunks of information, it would be easier to translate.
[add][$3][$2][$1]
A scanner breaks strings down into tokens. Not as simple as looking for spaces, the words need to make sense.
Scanner extracts the Kind and Lexeme
Indentifier: "sub"
int a = 241;
INT: int
Identifier: "a"
EQ: =
NUM: 241
SEMI: ";"
The string “241” has four ways to be split up.
In the C statement int x = 241;
we want to interpret “241” as a single number.
Definition
Maximal munch: Always choose the longest meaningful token while breaking up the string
Question
Then how do we know when to stop? There’s no limit to how long a number can be.
We need a way to wrangle infinitely many possible strings.
Definition
An alphabet is a non-empty, finite set of symbols, often denoted by
Definition
A string (or word) is a finite sequence of symbols chosen from . The set of all strings over an alphabet is denoted by
Definition
A language is a set of strings
Definition
The length of a string is denoted by
Alphabets:
the Latin alphabet
the alphabet of binary digits
the alphabet of base 10 digits
hexadecimal
Strings:
is the empty string. It is in for any .
For , strings include or . Note and .
Languages:
or , the empty language
, the language consisting of (only) the empty string
, the set of strings over the alphabet consisting of an followed by or more characters followed by an .
The objective of our scanner is to break a string into words in a given language.
Simpler objective: Given a language, determine if a string belongs to the language.
How hard is this? Depends on the language.
any dictionary: Trivial
: Very easy
Valid MIPS assembly programs : Easy
Valid Java/C/C++ programs : Harder
Set of programs that halt : Impossible
Memberships in Languages
In order of relative difficulty:
- Finite
- Regular
- Context-free
- Context-sensitive
- Recursive
- Impossible languages
Consider the language {bag, bat, bit}.
We can just check whether a string is in the list.
Question
But can we do this in a more efficient way?
Hash Map/Set
for each w in L {
if w' = w
Accept
}
Still not the most efficient
Most efficient is to check each letter at a time and reject if it can’t be possible.
In our example, all our words start with b. If our first symbol for input is not b, we can instantly reject. If the first symbol is b and our second is a, we can reject “bit” and keep going. Similar to prefix tree.
Important Features of Diagram
- An arrow into the initial start state
- Accepting states are two circles
- Arrows from state to state are labelled
- Error state(s) are implicit
Definition
A regular language over an alphabet consists of one of the following:
- The empty language and the language consisting of the empty word are regular
- All languages for all are regular.
- The union, concatenation or Kleene star of any two regular languages are regular.
- Nothing else
Let be three regular languages. Then the following are regular languages
Union:
Concatenation:
Kleene star:
Suppose that {up, down}, {hill, load} and over appropriate alphabets. Then
- {up, down, hill load}
- = {uphill, upload, downhill, download}
Let . Explain why the language is regular.
Solution: and are finite, and so regular. is also regular, regular languages are closed under Kleene star. Then, the concatenation must also be regular.
In tools like egrep
, regular expressions are often used to help find patterns of text. Regular expressions are just a way of expressing regular languages.
The notation is very similar, except we drop the set notation. As examples
- becomes
- becomes
- Concatenation is never written with the explicit
- Order of operations:
- means optional
Extending the Finite Languages Diagram
We can allow our picture to have loops.
Definition
Deterministic Finite Automata (DFA). A DFA is a -tuple ()
- is a finite non-empty set (alphabet)
- is a finite non-empty set of states
- is a start state
- is a set of accepting states
- is our total transition function (given a state and a symbol of our alphabet, what state should we go to?).
Lecture 5
Creating Binary in C++
How do we write the binary output
0001 0100 0100 0000 1111 1111 1111 1101
bne $2, $0, -1
Convert the registers to binary bits
We can use bit shifting to put the information into the correct position.
bne
opcode is 5
int instr = (5 << 26) | (2 << 21) | (0 << 16) | offset
Shift opcode 26 left, shift binary representation of register $s
left 21 bits, shift binary representation of register $t
by 16 bits.
We need to be careful with the offset.
Recall in C++, ints are 4 bytes. We only want the last two bytes. First we need to apply a “mask” to only get the last 16 bits.
int offset = -1
32-bit 000....1
32-bit 111....0
bitwise or
32-bit 111....1
This does not work since we are changing all 32 bits, we only need to change the last 16 bits.
We discard the leftmost 16 bits above.
int offset = -1 & 0xffff
Can declare the offset to be int16_t
(still need to do the bit masking).
Just need to cout << instr
right? Wrong.
This output would be 9 bytes, corresponding to the ASCII code for each digit of the instruction as interpreted in decimal. We want to put the four bytes that correspond to this number.
Printing Bytes in C++
You can also mask here to get the “last byte” by doing & 0xff if worried about which byte will get copied over.
Rules for DFA
States can have labels inside the bubble, this is how we refer to the states in .
For each character, follow the transition. If there is none, go to the implicit error state.
Once the input is exhausted, check if the final state is accepting. If so, accept. Otherwise, reject.
Warm-up Problem
Write a DFA over that
-
Accepts only words with an even number of s
-
Accepts only words with an odd number of s and an even number of s
-
Accepts only words where the parity of the number of s is equal to the parity of the number of s
-
Write a DFA over that accepts all words ending with bba.
-
Accepts only words with an even number of s
is our start state
even s
odd s
is defined by
- Accepts only words with an odd number of s and an even number of s
is our start state
even s, odd s
odd s, even s
even s, odd s
odd s, odd s
is defined by:
- Accepts only words where the parity of the number of s is equal to the parity of the number of s
is our start state
even s, odd s
odd s, even s
even s, odd s
odd s, odd s
is defined by:
This is not the optimal solution.
is our start state
same parity
different parity
is defined by
- Write a DFA over that accepts all words ending with bba.
is our start state
is defined by
- (self loop)
- (need to start over)
- (self loop)
- (start over)
- (start over but not at beginning)
Definition
The language of a DFA is the set of all strings accepted by , that is:
w = a_1a_2...a_n
s = q_0
for i in 1 to n do
s = \delta(s,ai)
end for
if s \in A then
Accept
else
Reject
end if
You could also use a lookup table.
Theorem (Kleene)
is regular if and only if for some DFA . That is, regular languages are precisely the languages accepted by DFAs.
Question
Is C a regular language?
The following are regular
- C keywords
- C identifiers
- C literals
- C operators
- C comments
Sequences of these are also regular (Kleene star). Finite automata can do our tokenization.
Question
What about punctuation? Even simpler, set and {strings with balanced parentheses}. Is regular?
w = ()
w = (())
w = (((()))((())))
This language is not regular. It is a context-free language (more on this later).
Question
How does our scanner work?
Our goal is: Given some text, break up the text into tokens.
Some tokens can be recognized in multiple different ways.
w = 0x12cc
. This could be a single hex, or could be an int followed by an (x) followed by another int (1) followed by another int (2) and followed by an id (cc).
Formalization of this problem.
Given a regular language (say, is all valid MIPS or C tokens), determine if a given word is in (or in other words, is (we don’t consider the empty program to be valid).
Consider the language of just ID tokens in MIPS:
is our start state
is defined by:
w = abcde
Initially in .
Then there are two options. Either stay in , or return the empty token.
In this case, we would get [ID, "a"]
Go back to . We can do the same thing again after reaching with b
.
Now, we also get [ID, "b"]
We can keep doing this
This time, we want to stay in . We will stay in for d
and e
. We have consumed all the symbols.
We would then output [ID, "cde"]
. We now have a longer token.
Lecture 6
Given a regular language , determine if a word is in .
Two algorithms:
- Maximal munch
- Simplified maximal munch
Idea: Consume the largest possible token that makes sense. Produce the token and then proceed.
Difference:
- Maximal Munch: Consume characters until you no longer have a valid transition. If you have characters left to consume, backtrack to the last valid accepting state and resume.
- Simplified Maximal Munch: Consume characters until you no longer have a valid transition. If you are currently in an accepting state, produce the token and proceed. Otherwise go to an error state.
DFA for now
is our start state
is defined by
Note there is a /output token from the accepting states to the start state.
Maximal Munch:
- Algorithm consumes and flags this state as its accepting, then tries to consume but ends up in an error state.
- Algorithm then backtracks to first since that was the last accepting state. Token is output.
- Algorithm then resumes consuming and flags this state as it is accepting, then tries to consume but ends up in an error state.
- Algorithm then backtracks to first since that was the last accepting state. Token is output.
- Algorithm then consumes the second , the second , the first , the third and runs out of input. This last state is accepting so output our last token and accept.
Simplified Maximal Munch:
- Algorithm consumes , then tries to consume but ends up in an error state. Note there is no keeping track of the first accepting state.
- Algorithm then checks to see if is accepting. It is not (as ).
- Algorithm rejects .
- Note: This gave the wrong answer, but this algorithm is usually good enough, and is used in practice.
Simplified demo
: (accepting)
ERROR (give error)
Consider the following C++ line:
Notice that at the end, there is the token . This, on its own is a valid token. With either algorithm we would reject this declaration. To do this declaration, you needed a space:
Question
What was the point of scanning?
Machine language is hard to write: We want to use assembly language.
We need to scan assembly lines in order to compile assembly language.
All the machine language operations we’ve seen so far, as assembly.
The order of $s
,$t
, and $d
are different in assembly than machine code.
Suppose that $1
contains the address of an array and $2
takes the number of elements in this array (assume small enough that we don’t have to worry about overflow). Place the number in the last possible spot in the array.
Question
Write an assembly language MIPS program that takes a value in register
$1
and stores the value of its last base-10 digit in register$2
MIPS also comes equipped with control statements.
beq $s, $t, i
beq $s, $t, i
If , doesn’t skip next instruction. If , infinite loop.
Question
Write an assembly language MIPS program that places the value in register
$2
if the signed number in register$1
is odd and places the value in register$2
if the number is even.
Inequality Command
slt $d, $s, $t
Set Less Than. Sets the value of register $d
to be provided the value in register $s
is less than the value in register $t
and sets it to be otherwise.
Note: There is also an unsigned version of this command.
Question
Write an assembly language MIPS program that negates the value in register
$1
provided its positive.
Question
Write an assembly language MIPS program that places the absolute value of register
$1
in register$2
With branching we can even do looping.
Question
Write an assembly language MIPS program that adds together all even numbers from to inclusive. Store the answer in register
$3
.
Hard coding the above isn’t good for the long run. We fix this by using a label.
label: operation commands
Explicit Example
We can thus do
Assembler computes the difference between the program counter and top
. PC is the line number after the current line.
Lecture 7
What if a procedure wants to use registers that have data already.
We could preserve registers: save and restore them.
We have lots of memory in RAM we can use. We don’t want our procedures to use the same RAM.
Used RAM goes after Free RAM.
Calling procedures pushes more registers onto the stack and returning pops them off.
We call $30
our stack pointer.
Template for Procedures
procedure modifies $1
and $2
.
Entry: preserve $1
and $2
Exit: restore $1
and $2
There is a problem with returning:
We get a new command jalr $s
.
Jump and Link Register. Sets $31
to be the PC and then sets the PC to be $s
. Accomplished by temp = $s
then $31 = PC
then PC = temp
.
jalr
will overwrite register $31
. How do we return to the loader from main after using jalr
? What if procedures call each other?
We need to save this register first.
Question
How do we pass arguments?
Typically, we’ll just use registers. If we have too many, we could push parameters to the stack and then pop them.
Sum Evens to (assume )
Question
How do we print to the screen or read input?
We do this one byte at a time.
Output: Use sw
to store words in location 0xffff000c
. Least significant byte will be printed.
Input: Use lw
to load words in location 0xffff0004
. Least significant byte will be the next character from stdin
.
Print cs
to the screen followed by a newline character.
Let’s finish up the assembler. Language translation involves two phases: Analysis and Synthesis.
Analysis: Understand what is meant by the input source. Use DFAs and maximal munch to break into tokens. But there’s more; valid ranges, labels.
Synthesis: Output the equivalent target code in the new format
The Biggest Analysis Problem
How do we assemble this code:
The problem is that myLabel
is used before it’s defined: we don’t know the address when it’s used.
The best fix to this is to perform two passes.
Pass 1: Group tokens into instructions and record addresses of labels. (Note: multiple labels are possible for the same line).
Pass 2: Translate each instruction into machine code. If it refers to a label, look up the associated address and compute the value.
Lecture 8
Recall the definition of DFA
We can extend the definition of to a function defined over via:
where and . If processing a string, process a letter first then process the rest of the string.
Info
A DFA given by accepts a string if and only if
What if we allowed more than one transition from a state with the same symbol?
To make the right choice, we would need an oracle that can predict the future.
This is called non-determinism. We then say that a machine accepts a word if and only if there exists some path that leads to an accepting state.
We can simplify the “ends with bba” example from previous lecture to an NFA.
is our start state
Transitions
Definition
Let be an NFA. We say that accepts if and only if there exists some path through that leads to an accepting states.
The language of an NFA is the set of all strings accepted by , that is:
Definition
An NFA is a -tuple (
- is a finite non-empty set
- is a finite non-empty set of states
- is a start state
- is a set of accepting states
- is our total transition function. Note that denotes the power set of , that is, the set of all subsets of . This allows us to go to multiple states at once.
We can extend the definition of via:
where . We also have:
Definition
An NFA given by accepts a string if and only if
Using the NFA defined earlier (bba) process abbba
abbba
Process a
(self-loop is the only option)
Process b
(two options)
Process b
Process b
Process a
Thus the input string is accepted by the NFA.
Question
Why are NFAs not more powerful than DFAs?
Even the power-set of a set of states is still finite. We can represent the set of states in the NFA as single states in the DFA.
Algorithm to convert from NFA to DFA.
- Start with the state
- From this state, go to the NFA and determine what happens on each for each . The set of resulting states should become its own state in your DFA.
- Repeat the previous step for each new state created until you have exhausted every possibility.
- Accepting states are any states that included an accepting state of the original NFA.
Previous NFA as a DFA.
Transitions (for DFA)
States in our DFA
Example
Let Write an NFA such that
not ending with a
is our start state
ending with a
Transitions:
Example
Let Write an NFA such that
First element of union
is our start state
Transitions
Second element of union
is our start state
Transitions
Question
How do we combine these?
Transitions
The DFA is complicated. Note: Combining two languages is non-obvious.
Summary: From Kleene’s theorem, the set of languages accepted by a DFA are the regular languages. The set of languages accepted by DFAs are the same as those accepted by NFAs. Therefore, the set of languages accepted by an NFA are precisely the regular languages.
Question
What if we permitted state changes without reading a character.
These are known as transitions.
Definition
An -NFA is a 5-tuple()
- is a finite non-empty set that does not contain the symbol
- is a finite non-empty set of states
- is a start state
- is a set of accepting states
- is our total transition function.
These -transitions make it trivial to take the union of two NFAs.
Transitions:
Extending for an -NFA
Define to be the epsilon closure of the set of states , that is, the set of all states reachable from in 0 or more transitions.
Note, this implies that
Again we can extend the definition of to a function via:
where . We also have
Definition
An -NFA given by accepts a string if and only if
Lecture 9
Recall from last lecture:
Extending for an -NFA
: a set of state
-NFA:
Simulating an -NFA
Let be the epsilon closure of a set of states . Recall
if then Accept
Otherwise Reject
Example:
is the start state
Transitions:
Take the string
Example:
is the start state
Transitions:
This machine represents the regular language
Take the string
“a”
Next symbol
“b”
since is available form via transitions.
Next symbol
“c”
Last symbol
“a”
Since , accept.
DFA NFA -NFA Regular Regular Expression
If we can show that an -NFA exists for every regular expression, then we have proved one direction of Kleene’s. We can do this by structural induction.
Regular Language
NFA that recognizes
is our start state
NFA that recognizes
is our start state
NFA that recognizes
is our start state
Transitions:
NFA that recognizes union
Connect start state with first state of and via
NFA that recognizes concatenation
Connect start state to start state of with transition. Connect the final state in with the first state in via transition.
NFA that recognizes
From the accepting state of , draw a transition back to start state and (accepting state) . has a transition to the beginning of .
We have completed Scanning. Now onto syntax.
Motivating Example:
Consider and
Question
Is this language regular? Can we build a DFA for L?
No.
Consider the regular attempt:
The is the problem. What goes inside the parentheses is the entire language of matched parentheses. What if we could recurse in regular expressions?
(Note: this just covers , not e.g., )
In terms of power, context-free languages are exactly regular languages plus recursion. In terms of expression, rather than extend regular expressions, we have a different form called grammars.
Definition
Grammar is the language of languages.
Grammars help us describe what we ware allowed and not allowed to say. Context-free grammars are a set of rewrite rules that we can use to describe a language.
CFG for C++ has a lot of recursion.
Definition
A Context Free Grammar (CFG) is a 4 tuple where
- is a finite non-empty set of non-terminal symbols
- is an alphabet; a set of non-empty terminal symbols.
- is a finite set of productions, each of the form where and
- is a starting symbol
Note: We set to denote the vocabulary, that is, the set of all symbols in our language.
Conventions:
Lower case letters from the start of the alphabet, i.e., a, b, c, , are elements of .
Lower case letters from the end of the alphabet, i.e., w, x, y, z, are elements of (words)
Upper-case letters, i.e., A, B, C, , are elements of (non-terminals)
is always our start symbol.
Greek letters, i.e., , are elements of (recall this is ()
In most programming languages, the terminals of the context-free languages are the tokens, which are the words in the regular language.
This is why scanners categorize tokens (e.g. all infinity IDs are “ID”): so that the CFL’s alphabet is finite.
It is possible to define CFGs directly over the input characters: this is called scannerless.
Let’s revisit and
We can also write this using a shorthand:
Find a derivation of . Recall our CFG above.
Definition
Over a CFG , we say that
- derives and we write if and only if there is a rule in .
- if and only if there is a rule in .
- if and only if a derivation exists, that is, there exists for such that . Note that can be .
Solution:
Hence,
Question
Why Context-Free?
Context-free languages actually add a sort of context to regular languages, so why are they “context free”?
They’re free of a different sort of context. For instance, a context-free language can’t catch this:
Need the context that a
is an int to know that this isn’t allowed.
Definition
Define the language of a CFG () to be
Definition
A language is context-free if and only if there exists a CFG such that
Every regular language is context-free.
- Union:
- Concatenation:
- Kleene Star:
Lecture 10
Practice
Let . Find a CFG for each of the following
CFG:
Derivation:
- Palindromes over
A fundamental example
Let’s consider arithmetic operations over Find
- A CFG for : arithmetic expressions from without parentheses, and a derivation for
- A CFG for : Well-formed arithmetic expressions from with balanced parentheses, and a derivation for
: Expression without parentheses
: Expression with parentheses
This does not work. Instead, we need
Notice, in these two derivations that we had a choice at each step which element of to replace.
Leftmost derivation: In the first derivation, we chose to do a left derivation, that is, one that always expands from the left first.
Rightmost derivation: In the second derivation, we chose to do a right derivation, that is, one that always expands from the right first.
Parse Trees
flowchart TD
S --- a
S --- S1[S]
S --- b
S1[S] --- a1[a]
S1[S] --- S2[S]
S1[S] --- b1[b]
S2[S] --- a2[a]
S2[S] --- S3[S]
S2[S] --- b2[b]
S3[S] --- ε
Question
Is it possible for multiple leftmost derivations (or multiple rightmost derivations) to describe the same string?
Consider two leftmost derivations for .
flowchart TD
S --- S1[S]
S1 --- a
S --- R1[R]
R1 --- -
S --- S2[S]
S2 --- S3[S]
S3 --- b
S2 --- R2[R]
R2 --- *
S2 --- S4[S]
S4 --- c
flowchart TD
S[S] --- S1[S]
S1 --- S2[S]
S2 --- a
S1 --- R1[R]
R1 --- -
S1 --- S3[S]
S3 --- b
S --- R2[R]
R2 --- *
S --- S4[S]
S4 --- c
Definition
A grammar for which some word has more than one distinct leftmost derivation/rightmost derivation/parse tree is called ambiguous.
Question
Why do we care about this?
As compiler writers, we care about where the derivation came from: parse trees give meaning to the string with respect to the grammar.
Post-order traversal of trees pseudocode.
We use some sort of precedence heuristic to guide the derivation process.
Or, make the grammar unambiguous. This is what we did without first (incomplete) (by adding parentheses).
Lecture 11
There is a better way to eliminate ambiguity.
In a parse tree, we evaluate the expression in a depth-first, post-order traversal.
We can make a grammar left/right associative by crafting the recursion in the grammar.
Forcing Right Associative
This forces a right-associative grammar (for , will evaluate first)
Forcing Left Associative
This forces a left-associative grammar (recurses to the left).
We can use this to create a grammar that follows BEDMAS (by making appear further down the tree).
Question
If is a context-free language, is there always an unambiguous grammar such that ?
No. This was proven by Rohit Parikh in 1961.
Question
Can we write a computer program to recognize whether a grammar is ambiguous?
No.
Question
Given two CFGs and , can we determine whether . What about determining whether ?
This is still undecidable.
We use parsers to handle CFLs and CFGs.
Formally: Given a CFG and a terminal string , find the derivation, that is, the steps such that or prove that
We can find this with two broad ideas (top-down and bottom-up).
Top down parsing: Start with and then try to get to .
Bottom-up Parsing: Start with and work our way backwards to .
Top-Down Parsing
Start with and look for a derivation that gets us closer to . Then, repeat with remaining non-terminals until we’re done.
The main trick is “look” for derivation. Thus, the core problem is to predict which derivation is right.
We present the algorithm (in practice, real compliers do not use this).
Here is the pseudocode for a top-down parsing algorithm (before having the knowledge of a predictor table)
This has some problems.
- The oracle is not real
- When we reach the end of input, we have no way of realizing we weren’t done with the stack
- The oracle should be able to tell us that no production matches at all
Lecture 12
We augment the grammar to begin and end with ⊢ and ⊣ to solve problem 2.
We treat these symbols as BOF and EOF characters.
Now let’s solve problem 1 and 3.
This oracle could try all possible productions (way too expensive).
Solution: Use a single symbol lookahead to determine where to go. (This is a heuristic, doesn’t always work).
We construct a predictor table to tell us where to go: given a non-terminal on the stack and a next input symbol, what rule should we use?
Always looking at a terminal input symbol, so figuring out how they match is non-trivial.
But this is hard. Consider the following
How could we decide between 1 and 2 based on the next symbol? They both start with , which is the same non-terminal, so either can start with a, b, or c.
This is a limitation of the top-down parsing algorithm.
We will look at a simpler grammar.
Let’s look at
flowchart TD
S' --- BOF
S' --- S
S' --- EOF
S --- L
L --- a
S --- R1[R]
R1 --- +
S --- S1[S]
S1[S] --- L1[L]
L1[L] --- b
S1[S] --- R2[R]
R2[R] --- +1[+]
S1[S] --- L2[L]
L2[L] --- c
flowchart TD
S' --- BOF
S' --- S
S --- L
L --- I
I --- a
S --- M
M --- O
O --- +1[+]
M --- S1[S]
S1[S] --- L1[L]
L1[L] --- I1[I]
I1[I] --- b
S1[S] --- M1[M]
M1[M] --- O1[O]
O1[O] --- +
M1[M] --- S2[S]
S2[S] --- L2[L]
L2[L] --- I2[I]
I2[I] --- c
S2[S] --- M2[M]
M2 --- empty
S' --- EOF
Predictor Table of the simple grammar above.
If in the non-terminal on the left, and encounter the terminal symbol on the top, pick the option of the number (see grammar above). Pop the elements that match the input string and repeat until we have finished (or encounter an error).
Definition
A grammar is called if and only if each cell of the predictor table contains at most one entry.
For an grammar, don’t need sets in our predict table.
Question
Why is it called ?
First : Scan left to right
Second : Leftmost derivations
Number of symbol lookahead:
Not all grammars work with top-down parsing. Our simple grammar from above is not since more than one value is in a cell in the predictor table.
Constructing the Lookahead Table
Our goal is the following function, which is our predictor table.
Predict production rule(s) that apply when is on the stack, is the next input character.
To do this, we also introduce the following function, First() : First(): set of characters that can be the first symbol of a derivation starting from .
. Thus, First(
More formally:
Predict
First(
Example of First
Recall the grammar from above with 8 steps.
First
First First( First(. So to compute First() we first need First().
First (
First ( First (
First (
First (
First (
First(
First (
The problem with Predict, for some is that it is possible that . This would mean that the didn’t come from but rather some symbol after .
Example
Notice that
In the top-down parsing algorithm, we reach the stack is and remaining input is .
We look at Predict( which is empty since First(. Thus we reach an error.
We augment our predict table to include the elements that can follow a non-terminal symbol, if it can reduce to .
In this case, we need to include that Predict(
To correct this, we introduce two new functions.
Nullable(: boolean function; for is true if and only if
Follow(: for any , this is the set of elements of that can come immediately after in a derivation starting from
Definition
We say that a is nullable if and only if Nullable( true
Example of Follow
Follow (Always)
Follow(
Follow(
Follow(
Question
What happens with Predict if Nullable( = false?
Follow is still some set of terminals but it won’t be relevant since we would need to consider what happens to First first.
Thus, the Follow function only matters if Nullable is true.
This motivates the following correct definition of our predictor table:
Definition
Predict(
This is the full, correct definition. Notice that this still requires that the table only have one member of the set per entry to be useful as a deterministic program.
Note that Nullable false whenever contains a terminal symbol.
Further, Nullable( Nullable( Nullable()
Thus, it suffices to compute Nullable() for all .
Lecture 13
Algorithm of Nullable()
Example of Nullable
Nullability Table
Iter | 0 | 1 | 2 | 3 |
---|---|---|---|---|
Thus, Nullable( = Nullable( and Nullable( Nullable( |
Notes about First
Main idea: Keep processing from a production rule until you encounter a terminal or a symbol that is not nullable. Then go to the next rule. Repeat until no changes are made during the processing.
Remember, , isn’t a real symbol, and can’t be in a First set.
For First, we will ignore trivial productions of the form based on the above observation.
Further, First( always.
We first compute First() for all and then we compute First( for all relevant
Computing First
Example of First
First Table
Iter | 0 | 1 | 2 | 3 |
---|---|---|---|---|
Recall, Nullable( = Nullable( and Nullable( Nullable(
Thus, First(, First(, First(, First(
Computing First (2)
Computing Follow
Example of Follow
Follow Table
Iter | 0 | 1 | 2 |
---|---|---|---|
Lecture 14
Cheat sheet
Nullable:
- implies that Nullable( true. Further Nullable(
- If and each of Nullable( then Nullable(
First:
- then First()
- then First() First() First() for each until Nullable() is false
Follow:
- then Follow() = First()
- and Nullable() , then Follow() = Follow() Follow()
Primary Issue
With this grammar:
The primary issue is that left recursion is at odds with . In fact, left recursive grammars are always not . Examine the derivations for and .
Notice that they have the same first character but required different starting rules from . That is Predict(. Our first step is to at least make this right recursive.
To make a left recursive grammar right recursive; say
where does not begin with the non-terminal , we remove this rule from our grammar and replace it with:
The above solves our issue
we get a right-recursive grammar. This is .
However: recall that we didn’t want these grammars, because they were right associative. This is an issue we need to resolve with new techniques.
Not all right right recursive grammars are . Consider
we get a right-recursive grammar, but not
Again, we have Predict(. There is still hope, we can apply a process known as factoring.
Left factoring
Idea: If where and is representative of other productions that do not begin with , then we can change this to the following equivalent grammar by left factoring:
Recursive-Descent Parsing
Fixing the parse trees from right-recursive and left-factored grammars is the #1 thing that recursive-descent ad hoc solutions fix
The actual sequence of steps is , but then they generate a different parse tree by changing it on the fly.
Bottom-up Parsing
Recall: Determining the in
Idea: Instead of going from to , let’s try to go from to . Overall idea: look for the RHS of a production, replace it with the LHS. When you don’t have enough for a RHS, read more input. Keep grouping until you reach the start state.
Our stack this time will store the in reverse order (Contrast to top-down which stores the in order)
Our invariant here will be Stack Unread Input (Contrast to top-down where invariant was consumed input reversed Stack contents
Example
Recall our grammar:
We wish to process
Stack | Read | Processing | Action |
---|---|---|---|
Shift | |||
Shift | |||
Shift | |||
Reduce (); pop , , push | |||
Shift | |||
Shift | |||
Shift | |||
Shift | |||
Reduce (); pop push | |||
Reduce (); pop push | |||
Shift | |||
Reduce (); pop push | |||
Accept |
Theorem
For any grammar , the set of viable prefixes (stack configurations), namely is a stack is the next character such that is a regular language, and the NFA accepting it corresponds to items of . Converting this NFA to a DFA gives a machine with states that are set of valid items for a viable prefix.
We will show how to use this theorem to create a , , and automata to help us accept the words generated by a grammar.
Consider the following context-free grammar:
Definition
An item is a production with a dot somewhere on the right hand side of a rule
Items indicate a partially completed rule. We will begin in a state labelled by the rule
That dot is called the bookmark
Construction
From a state, for each rule in the state, move the dot forward by one character. The transition function is given by the symbol you jumped over.
For example, with , we move the over . Thus, the transition function will consume the symbol .
The state we end up in will contain the item
In the new state, if the set of items we have for some non-terminal , we then add all rules with in the left-hand side of a production with a dot preceding the right-hand side.
In this case, this state will include the rules and .
Notice now we also have and so we also need to include the rules where is the left-hand side, adding the rule .
If we find ourselves at a familiar state, reuse it instead of remaking it.
We continue with these steps until there are no bookmarks left to move. Then we have the final DFA.
We skipped the NFA step by putting all these items in the same rule. You may see versions of this algorithm that involve building an NFA and then converting, but the result will be the same.
Back to the example
This automaton is our oracle. Run the stack through the automaton, and:
- If you end up in a state with the bookmark at the right-hand side of an item, perform that reduction
- If you end up in a state with the bookmark elsewhere, shift
- Else (error state), reject
Lecture 15
Stack (TOS: right) | Read | Unread | Action |
---|---|---|---|
Shift state 0 | |||
State 1 Shift | |||
State 5 reduce(3), pop d, push T | |||
State 4 reduce(2), pop T, push S | |||
State 2, shift | |||
State 6, shift | |||
State 5, reduce (3), pop d, push T | |||
State 7, reduce(1) pop S and T, push S | |||
The stack is a stack, so the bottom of the stack (beginning of our input) doesn’t usually change. We’re rerunning the whole DFA even when the prefix of our stack is the same. Because of this, our algorithm is .
Remember how we moved through the DFA in a state stack, and push and pop to the state stack at the same time as the symbol stack. That way, we don’t repeat getting to a state with a prefix that hasn’t changed.
This brings us to .
Stack column above becomes Symbol Stack
State Stack | Symbol Stack | Read | Unread | Action |
---|---|---|---|---|
Shift state 0 | ||||
State 1 Shift | ||||
State 5 reduce(3), pop d, push T | ||||
State 4 reduce(2), pop T, push S | ||||
State 2, shift | ||||
State 6, shift | ||||
State 5, reduce (3), pop d, push T | ||||
State 7, reduce(1) pop S and T, push S | ||||
Possible Issues |
Issue one (Shift-Reduce): What if a state has two items of the form:
Question
Should we shift or reduce?
Note, having two items that shift, e.g.:
is not an issue.
Definition
A grammar is if and only if after creating the automaton, no state has a shift-reduce or reduce-reduce conflict.
Recall that grammars were at odds with recursive languages.
Question
Are grammars in conflict with a type of recursive language?
Not usually. Bottom-up parsing can support left and right recursive grammars. However not all grammars are grammars.
Consider the grammar
New Automaton
Conflict
State 4 has a shift-reduce conflict.
Suppose the input began with . This gives a stack of and then we reduce in state 5, so our stack changes to and we move to state 4 via state 1.
Should we reduce It depends.
We add a lookahead to the automation to fix the conflict. For every , attack Follow(). Recall:
Note that Follow() and Follow() . So state 4 becomes
and
Apply if the next token is , and apply if the next token is .
With lookahead from Follow sets on reduce states, we call these parses parsers.
is not a simplified version of , it’s just different.
Building the Parse Tree
With top-down parsing, when we pop from the stack and push and is a node, make the new symbols the children.
With bottom-up parsing, when you reduce (from a stack with and ). You then keep these two old symbols as children of the new node .
A Last Parser Problem
Most famous problem in parsing: the dangling else.
if(a) if (b) S1: else S2
Lecture 16
Where are we now?
We have finished syntactic analysis, are now on to type checking (semantic analysis).
Semantics: Does what is written make sense?
Not everything can be enforced by a CFG. Examples
- Type checking
- Declaration before use
- Scoping
- Well-typed expressions
To solve these, we can move to context-sensitive languages.
As it turns out, CSL’s aren’t a very useful formalism.
We already needed to give up many CFGs to make a parser handle CFLs; with CSLs, it would be even worse.
As such, we treat context-sensitive analysis as analysis (looking over the parse tree generated by CFL parsing) instead of parsing (making its own data structure).
- pre-order tree traversal
- post-order tree traversal
- hybrid of the two
We will traverse our parse tree to do our analysis.
We still need to check for:
- Variables declared more than once
- Variables used but not declared
- Type errors
- Scoping as it applies to the above
Declaration errors
Question
How do we determine multiple/missing declaration errors?
We’ve done this before. We construct a symbol table.
- Traverse the parse tree for any rules of the form dcl type ID.
- Add the ID to the symbol table
- If the name is already in the table, give an error.
Checking
To verify that variables have been declared. Check for rules of the form factor ID, and lvalue ID. If ID is not in the symbol table, produce an error. The previous two passes can be merged.
def: dcl type ID (make sure not already in symbol table, and add to symbol table)
Use of variables:
factor ID (check to see if it is in symbol table (return error if not))
lvalue ID (check to see if it is in symbol table (return error if not))
Question
With labels in MIPS in the assembler, we needed two passes. Why do we only need one in the compiler?
We need to declare variables before using them. This is not true for labels.
Note that in the symbol table, we should also keep track of the type of variables. Why is this important? Just by looking at the bits, we cannot figure out what it represents. Types allow us to interpret the contents.
Good systems prevent us from interpreting bits as something we should not.
For example
should be a type mismatch.
This is just a matter of interpretation.
In WLP4, there are two types: int
and int*
for integers and pointers to integers.
For type checking, we need to evaluate the types of expressions and then ensure that the operations we use between types corresponds correctly.
Question
If given a variable (in the wild), how do we determine it’s type?
Use its declaration. Need to add this to the symbol table.
We can use a global variable to keep track of the symbol table:
Some things can go wrong. This doesn’t take scoping into account. Also need something for functions/declarations.
Consider the following code. Is there an error?
No. Duplicate variables in different procedures are okay.
Is the following an error?
Yes. The variable x
is not in scope in wain
.
Is the following an error?
Yes. We have multiple declarations of foo
.
We resolve this with a separate symbol table per procedure. We also need a global symbol table.
Question
A symbol table is a map, what are we mapping each symbol to?
For type checking, we need to know the type of each symbol. In WLP4, a string could be sufficient, but you should use a data structure so you can add more later.
Question
What about procedures?
Procedures don’t have types, they have signatures.
In WLP4, all procedures return int
, so we really just need the argument types: an array of types.
Computing the signature.
Simply need to traverse nodes in the parse tree of these forms.
- params
- params paramlist
- paramlist dcl
- paramlist dcl COMMA paramlist
This can be done in a single pass.
Consider
Global symbol table:
foo: [int], wain: [int*, int]
Local symbol tables:
foo: a: int, x: int
wain: a: int*, b: int
Type errors
Question
What are type errors and how to find them?
Two separate issues:
- What are type errors? (Definition)
- How to find them? (Implementation)
Definition of Type (Errors)
Need a set of rules to tell us
- The type of every expression
- Whether an expression makes sense with the types of its subexpressions
- Whether a statement makes sense with the types of its subexpressions
Detection of Type (Errors)
There’s really only one algorithm with a tree: traverse the tree. Implement a (mostly) post-order traversal that applies defined rules based on which expressions it encounters.
Inference rules are Post rules (like in CS245)
If an ID is declared with type then it has this type:
Numbers have type int
:
NULL is of type int*
:
Inference rules for types
Inference rules are the true case. If no inference rule matches, that means the expression or statement doesn’t type check: type error.
Look for good, not for bad: errors should always be the “else” case.
Parentheses do not change the type
The Address of an int
is of type int*
Dereferencing int*
is of type int
If has type int
then new int[E]
is of type int*
Arithmetic Operations
Multiplication
Division
Module
Addition
Subtraction
Procedure Calls:
The basic kind of statement type is an expression statement. An expression statement is okay as long as the expression has a type. We will need rules for all the other statements too. Statement’s don’t have a type, but can be “well typed”.
Lecture 17
Control statements:
The value of T
above should be a boolean. But WLP4 doesn’t have booleans.
Our grammar forces it to be a boolean expression, so we don’t need to check that.
But, we still need to check its subexpressions.
Inference rules for well-typed
An if statement is well-typed if and only if all of its components are well typed
A while statement is well-typed if and only if all of its components are well-typed
There is a final sanity check with the left- and right-hand sides of an assignment statement.
Given an expression, say , notice the left-hand side and the right-hand side represent different things.
The left-hand side represents a place to store data; it must be a location of memory. The right-hand side must be a value; that is, any well-typed expression.
Anything that denotes a storage location is an lvalue.
Consider the following two snippets of code
This is okay; the lvalue x
is a storage location.
This is not okay; the lvalue 5 is an integer and not a storage location.
For us, lvalues are any of variable names, dereferenced pointers and any parenthetical combinations of these. These are all forced on us by the WLP4 grammar so the checking is done for you.
The empty sequence is well-typed
Consecutive statements are well-typed if and only if each statement is well-typed
Procedures are well-typed if and only if the body is well-typed and the procedure returns an int
.
Type-checking recommendations - brush up on recursion. Everything from this point on is traversing a tree.
Example
Type-check a tree. We will use the code from last time.
Infer types for sub expressions using the rules.
Find three distinct one-character changes that make this code fail to type-check (while still passing parsing):
Change int a
to int* a
.
Change return a + b;
to return *a + b
.
Change return a + b
to return &a + b
.
There are infinitely many equivalent MIPS programs for a single WLP4 program.
Question
Which should we output?
Correctness is most important. We seek a simplistic solution. Efficiency to compile and efficiency of the code itself.
Real compilers have an intermediate representation (IR) that’s close to assembly code but (at least) has infinite registers. This IR is good for optimization. We don’t do optimization, we we won’t use IR.
Our step of generating assembly will be (more-or-less) the “generate IR” step of a larger compiler. MIPS is our IR.
Recall our mips.twoints
convention that registers 2 store the input values, and our usual convention to return in register $3.
Question
How did we know where
a
was stored? What if we had arguments? What if we had done something complex with intermediary results?
The parse tree will be the same if we’d done return b
instead of return a
.
The parse tree isn’t going to be enough to determine the difference between the two pieces of code.
Question
How can we resolve this? How can we distinguish between these two codes?
We use the symbol table.
Symbol | Type | Location |
---|---|---|
$1 | ||
$2 |
We make the convention that $4
always contains the number 4.
Instead of the symbol table storing registers, it should store the offset from the stack pointer.
Symbol | Type | Offset |
---|---|---|
4 | ||
0 | ||
Offset from stack pointer will cause problems. |
Variables also have to go on the stack but we don’t know what the offsets should be until we process all of the variables and parameters.
For example
Code generated
As we process the code, we need to be able to compute the offset as we see the values. Also, we need to handle intermediate values of complicated arithmetic expressions by storing on the stack.
Symbol | Type | Offset (from $30) |
---|---|---|
How then do we arrange it so that when we see the variable, we know what the offset is? Remember that the key issue here is that $30 (the top of the stack) changes.
Reference the offset from the bottom of the stack frame. We will store this value in $29. This is called the “frame pointer”.
If we calculate offsets from 29 will be unchanged.
Symbol | Type | Offset (from $29) |
---|---|---|
What about a more complicated program?
Question
How do we handle this?
When a
and b
were in registers, we could just subtract them. Now we need to load them, then subtract them.
Load them into registers? We’ll run out of registers again with more complicated behaviour.
We’ll continue to use 5 for intermediate scratch values.
Question
Where does this approach break down?
Consider something like . Would need to load , load , load , compute , then compute the answer. This would require a third register.
Question
Where should we store these values instead?
On the stack again.
Abstraction: We’ll use some shorthand for our code
code() represents the generated code for .
code(): (where is a variable)
lw $3, N($29)
push($x):
pop($x):
Lecture 18
Let’s compute the MIPS code for
Solution
We can generalize this technique so we only need two registers for any computation. (Divide and conquer).
Singleton grammar productions are relatively straightforward to translate:
`code(S → BOF procedures EOF): code(procedures)
code(expr -> term): code(term)
lvalues are odd. Recursion might do the wrong thing.
The basic idea of our code function is that it produces the code to put the value of the expression in $3
.
lvalues shouldn’t actually generate values.
If we have { int x = 0; x = 3; }
, having the code for x
in x=3
generate 0 would be useless.
We could have the code function do something different for lvalues
code(lvalue -> anything)
produces an address in $3
instead of its value.
Or, we could have the code function for expressions that include lvalues do something different based on the kind of lvalue.
code(stmt -> (lvalue -> ID) BECOMES expr SEMI)
has to be distinct from code(stmt->(lvalue -> STAR expr) BECOMES expr SEMI)
.
This code is less modular and less maintainable (not recommended).
We have two ways of outputting and one way of inputting: println
, putchar
, getchar
.
Character I/O corresponds directly to memory-mapped I/O addresses:
println
is more complex than getchar
or putchar
.
What do we generate for stmt -> PRINTLN LPAREN expr RPAREN SEMI
.
We had to write MIPS code to do this, but that would be a lot of code to generate each time.
A compiler mixes the code it outputs with code from a runtime environment
Definition
A runtime environment is the execution environment provided to an application or software by the operating system to assist programs in their execution. Such an environment may include things such as procedures, libraries, environment variables etc.
MERL files. MIPS Executable Relocatable Linkable. Format for object files. MERL helps us store additional information needed by the loader and linker from output.
We will now need to do
gcc and clang are not compilers, they are compiler drivers. They call other programs to compile, assemble, and link code.
To use print, we need to add .import print
to the BOF.
After this we can use print
in our MIPS code. It will print the contents of $1
.
We will write Baby’s First Operating System
Where should this be stored? Could choose different addresses at assembly time, but how do we make sure they don’t conflict.
A more flexible option is to make sure that code can be loaded anywhere.
Loader’s job:
- Take a program as input
- Find a location in memory for
- Copy to memory, starting at
- Return to OS
Baby’s First OS v2
Question
How did we assemble
.word label
?
It compiles to an address; the address of that label assuming that the program was loaded at 0. Loader needs to fix this.
More problems
.word id
: need to add alpha to id
.word constant
: do not relocate
beq
bne
(whether they use labels or not): do not relocate
We translate assembly code into machine code (bits). Given: 0x00000018
, is this .word 0x18
or .word id
? We can’t know.
Thus, we need a way for our loader to know what does and what doesn’t need to be relocated. We need to remember which .words were labels.
In our MERL files we need the code, but also the location of any .word
ids.
MERL would be
Loading requires two passes:
Pass 1: Load the code from the file into the selected location in memory.
Pass 2: Using the relocation table, update any memory addresses that have relocation entries.
Even with this, it is possible to write code that only works at address 0:
We should never encode address as anything other than labels, so that your loader can update the references.
Lecture 19
Most of our statements have been completed (except for if and while).
We need to handle boolean tests for conditionals. Convention is to store 1 inside of $11
.
Code structure
Question
What is the code for the rule test expr expr?
For test expr expr we change slt $3, $5, $3
to slt $3, $3, $5
.
Translate test expr != expr.
Now for test expr expr?
The key idea is is the same as != .
We have by adding the line sub $3, $11, $3
to flip to and vice versa.
For test expr expr by using the fact that is the same as
Rule: statement → IF (test) {stmts 1} ELSE {stmts 2}
If we have multiple if statements, the label names will conflict.
We need a way of inventing totally unique label names.
We can keep track of how many if statements we have. Keep a counter ifcounter
.
Each time we have an if statement, increment this counter.
Use label names like else#
and endif#
where #
corresponds to the ifcounter
.
Rule: statement → WHILE (test) {statements}
Since we are generating MIPS code; we can also generate comments with MIPS code. Debugging code generators is hard.
Recap
$0
is always$1
and$2
are for arguments and inwain
$3
is always for output$4
is always$5
is always for intermediate computations$6
and$7
may be for intermediate computations$10
will storeprint
$11
is reserved for$29
is our frame pointer (fp)$30
is our stack pointer (sp)$31
is our return address (ra).
Prologue
At the beginning of the the code, we
- Load into
$4
and into$11
- Import print. Store in
$10
- Store the return address on the stack
- Initialize the frame pointer hence creating a stack frame
- Store registers and
Body
Need to store local variables in the stack frame. Contain MIPS code corresponding to the WLP4 program.
Epilogue
Need to pop the stack frame. Also need to restore the previous variables.
We have reached pointers. We need to support all the following
- NULL
- Allocating and deallocating heap memory
- Dereferencing
- Address-of
- Pointer arithmetic
- Pointer comparisons
- Pointer assignments and pointer access
NULL cannot be the value 0x0
it is a valid memory address. We want our NULL to crash in attempt to dereference. We pick a NULL that is not word-aligned (not a multiple of 4). So we can pick 0x1
.
Question
What about dereferencing?
factor STAR factor
The value in factor is a pointer (otherwise a type error). We want to access the value at factor and load it somewhere.
We load into $3
. Since factor is a memory address, we want to load the value in the memory address at $3
and store in $3
.
Need to be careful of the difference between lvalues and pointers.
Recall that an lvalue is something that can appear as the LHS of an assignment rule. We can have a rule factor → AMP lvalue
. Suppose we have an ID value a
. How do we find out where a
is in memory?
We use our symbol table. We can load the offset first and then use this to find out the location.
Comparisons of pointers work the same as with integers with one exception. Pointers cannot be negative, so slt
is not what we want to use. We should use sltu
instead.
Given test → expr COMP expr
how can we tell which of slt
or sltu
to use? We check the type of exprs.
Recall for addition and subtraction we have several contracts. The code for addition will vary based on the type of its subexpressions. For int + int
or int - int
, we proceed as before. This leaves contracts we need to consider that use pointers.
Addition
expr_1 → expr_2 + term
where type(expr_2) == int*
and type(term) == int
expr_1 → expr_2 + term
where type(expr_2) == int
and type(term) == int*
Subtraction
expr_1 → expr_2 - term
where type(expr_2) == int*
and type(term) == int
expr_1 → expr_2 - term
where type(expr_2) == int*
and type(term) == int*
Lecture 20
We need to handle calls such as new
and delete
We can outsource this work to the runtime.
Prologue Additions
The command init
initializes the heap. Must be called at the beginning. Takes a parameter in $2
and initializes data structure.
New
- Finds the number of new words needed as specified in
$1
- Returns a pointer to memory of beginning of this many words in
$3
if successful (otherwise places in$3
)
Delete
- Requires that
$1
is a memory address to be deallocated
We need to now deal with multiple function calls.
Question
Who should save which registers? The caller? The callee? What do functions need to update/initialize? How do we update our symbol table?
Question
What do we need to do for wain?
Import print
, init
, new
, delete
Initialize $4
, $10
, $11
Call init
Save $1
, $2
Reset stack
Call jr $31
For general procedures we don’t need any imports. But we need to update $29
, save registers, restore registers and stack and jr $31
Question
Who is responsible for saving and restoring registers?
Definition
The caller is a function that calls another function
Definition
The callee is a function that is being called by another function
Our current convention:
- Caller needs to save
$31
. Otherwise we lose the return address (to, e.g., the loader) once we compute our call tojalr
- Callee has been saving registers it will modify and restore at the end. The function shouldn’t be worried about which registers might be using. This makes sense as well from a programming point of view.
- Note that we have only used registers from
$1
to$7
(and registers$4
,$10
,$11
are constant) as well as registers$29
,$30
, and$31
. $30
is preserved through symmetry
Question
Who should save
$29
?
Assume that we will require that the callee will save $29
. Thus they will initialize $29
first:
g: sub $29, $30, $4
and then saves registers. Is this the right order to do things in?
If we save registers first, $29
is supposed to point to the beginning of the stack frame, but $30
has already changed to store all registers. This is fine for now, but $29
might be pointing to somewhere in the middle of the stack frame.
Having $29
in the middle is annoying since we will need it later. We choose to make $29
the bottom.
Therefore, we do:
This callee-save approach with $29
will work.
Caller-save approach: We could have the caller save $29
This is much easier (we are going to do this).
Must be careful where everything is relative to $29
.
Procedures:
We need to store the arguments to pass to a function.
For factor → ID(expr1, ..., exprn)
, we have
For procedure → int ID(params) {dcls stmts RETURN expr;}
we have
Question
When do we save registers? Before
code(dcls)
or after?
Saving them before is strange.
Stack
$30 | ||
---|---|---|
local vars of | frame of | |
saved regs of | frame of | |
$29 | args of | frame of |
$31 | frame of | |
$29 | frame of | |
Symbol Table Revisited |
`int g(int a, int b) { int c = 0; int d;}
Symbol table for looks like
Symbol | Table | Offset (from $29 ) |
---|---|---|
int | 8 | |
int | 4 | |
int | ?? | |
int | ?? |
Let’s try pushing the registers after pushing the declarations.
For procedure -> int ID(params) {dcls stmts RETURN expr;}
New stack
$30 | ||
---|---|---|
saved regs of | frame of | |
local vars of | frame of | |
$29 | args of | frame of |
$31 | frame of | |
$29 | frame of | |
Symbol Table Revisited Revisited |
int g(int a, int b) {int c = 0; ind d;}
Symbol | Table | Offset (from $29 ) |
---|---|---|
int | 8 | |
int | 4 | |
int | 0 | |
int | -4 |
Parameters should have positive offsets. Local variables should have non-positive offsets.
Symbol table should have added 4 num$$(params to each entry in the table.
This complicates pushing registers, because we’re now generating some code before we preserve register values.
Does this matter for us? No, because our declarations are forced to be simple. If the language allowed complex expressions then it would matter.
Labels can introduce another annoying problem
Consider the code
Question
What is the problem here?
We already have a label called print
. Since it is not a WLP4 procedure, it shouldn’t interfere with WLP4.
We can ban WLP4 code that uses function names that match some of our reserved labels like new
, init
, etc. This is not very future proof.
Since MIPS labels don’t have to be identical to WLP4 procedure names, we can change them.
We will prepend an ‘F’ to the front of labels corresponding to procedures. Then, so long as we don’t create any labels with a ‘F’ at the beginning for any other purpose it should be okay.
The print function above would correspond to a label Fprint
.
Revisiting Translation
factor → ID(expr 1, ..., exprn)
we have,
Compete example, with procedures:
We have finished code generation.
Compiler Optimizations
The goal of optimizations is generally to make code run faster
We want to reduce code size and make code run faster. (For bonus on A8, we are only concerned with reducing code size).
Implementing these optimizations is not easy.
Constant Folding
If we want to generate the code for :
Our code generator could produce the following code for
If we notice that each element of the expression is a constant, we can add the constants at compile time and output the code for the final value (instead of doing it at runtime).
If the expression was , we would need to know the value of variable . We cannot determine this at compile time.
Constant Propogation
Sometimes the value of a variable is known at compile time:
We can replace with its known value, so this is equivalent to:
We can then apply Constant Folding
Since x
is not used anywhere, we can eliminate the variable declaration entirely:
Lecture 21
Constant propagation is more difficult than constant folding.
We can only apply it if we know the variable’s value does not depend on the input during the part of the program we’re processing.
Common Subexpression Elimination
Even if the value of is not known, there is a simplification we can make when generating code for .
Here is the naïve code (assuming is at offset from $29
)
Even if the value of is not known, there is a simplification we can make when generating code for .
Since we’re adding the same variable twice, we can just do this.
We can do the same trick with larger expressions, e.g., if we have :
Question
Can we apply common subexpression elimination to this code?
No, CSE must not eliminate side effects.
Dead Code Elimination
Sometimes the compiler can determine that certain code will never execute, and can eliminate this code
The code inside the innermost if can be ignored.
Deleting this code has a size benefit, but no real performance benefit.
Dead code elimination interacts with other optimizations.
Normally we can’t apply constant propagation to in the return.
Now constant propagation can be used on as well.
DCE can allow constant propagation to occur. Conversely, constant propagation can allow the compiler to prove code is dead.
Register Allocation
We ran into the issue that for sufficiently complicated code, it is not possible to store all values in registers.
Our solution was to put everything on the stack because this makes generating code simpler and more consistent.
But using registers for storage is much faster than using RAM.
Real-world compilers try to use registers as much as possible.
A variable is live if the current value of the variable will be used at a later point in the program.
A variable should be in a register if and only if it is live.
If too many variables or values are live at the same time, we have to choose which ones to put in RAM vs registers.
- becomes live on line 1, and is last used on line 5
- becomes live on line 2, and is last used on line 6
- becomes live on line 4, and is last used on line 8
On lines 4 to 5, all three variables are live. If we only had two registers available, we would need to put one variable in RAM.
We can use live ranges to construct a graph indicating which ranges overlap, and use graph colouring algorithms to allocate registers (see MATH239).
If the live range graph can be coloured, where is the number of available registers, we can allocate all variables to registers.
Graph colouring can be slow (NP-complete problem), so it is usually approximated.
If the address-of operator is used on a variable then this variable must go in RAM.
Significant gains are possible just by implementing a basic register allocator. Optimizations to eliminate pushes/pops or decrease the number of instructions for a push/pop are effective on A8.
A heuristic can be allocating variables and temporaries to registers on a “first-come, first-served” basis.
In this case we’d need to modify the offset table so that there are two kinds of variable locations: offsets from the frame pointer, or registers.
Allocate non-parameter local variables in registers whenever possible.
Strength Reduction
This optimization involves replacing costly operations with equivalent faster operations. For example, multiplication is slower than addition.
- can be replaced with , which can then be optimized further using common subexpression elimination.
A more complex version involves optimizing loops which perform expensive operations involving the loop counter.
Peephole Optimization
This optimization happens after code generation is finished. This is used in LLVM.
Instead of directly outputting the generated code, the code is placed in a data structure and subject to further analysis. The analysis tries to find sequences of instructions that can be replaced with simpler sequences.
For example
A peephole optimization could change this to add $7, $1, $0
. This might be easier than making the code generation itself smarter.
We use a sliding window.
Inlining Functions
We replace a function call with the body of the function itself
This is equivalent to:
This removes the overhead of a function call.
Tail Recursion
A recursive function call is in tail position if it is the last thing the function executes before returning.
In this case, what happens normally is:
- Recursive call happens, pushes local variables to the stack
- Recursive call finishes, pops from the stack, returns
- Original call finishes, pops from the stack, returns
Tail call optimization is based on the observation that in this situation, the recursive call can reuse the stack frame of the original call instead of pushing its own stack frame. This saves a lot of space.
Original call pops the reused stack frame.
Can be replaced by
Lecture 22
Why do we split programs into multiple files?
Modularity, team development, faster build time
Question
How do we resolve situations where we have labels in different files?
One option is to cat all such files together, and then compile?
- Duplicate labels defined in different files
- Accidental use of labels which should be private
Can we assemble files first and then cat?
Almost. Only one piece of code can be at 0x0
at a time. These assembled files need to be MERL files, not just MIPS.
Concatenating two MERL files does not give a valid MERL file.
We still haven’t resolve the issue of labels in different files.
When we encounter a .word
where the label is not in the file, we need to use a placeholder (an arbitrary value), and indicate that we cannot run this program until the value of the label is given.
You cannot run a.asm
without linking with b.asm
.
We need to extend our MERL file to notify us when we need to assembly with multiple files.
But sometimes we make typos. Consider
Question
Did we make a mistake? Did we mean
.word banana
, or dod we mean for bananana to be provided by another MERL file?
How do we recognize such errors? Without any other changes, our assembler will believe that a label banana exists somewhere and would load this.
.import id
is the directive that tells the assembler which symbols to link in. This will not assemble to a word of MIPS. Errors occur if the label id is not in the current file and there is no .import id
in the file.
We need to add entries in the MERL symbol table. Previously we used the code 0x1
for relocation entries, but this isn’t a relocation entry.
New format code: 0x11
for External Symbol Reference (ESR).
Question
What needs to be in an ESR entry?
- Where the symbol is being used
- The name of said symbol
Format:
Question
What if labels are duplicated?
Suppose we have c.asm
along with our two other files that has:
We want label to not be exported, it should be self-contained.
.export label
will make label
available for linking with other files. As with .import
, it does not translate to a word in MIPS. It tells the assembler to make an entry in the MERL symbol table.
The assembler makes an ESD, or an External Symbol Definition, for these types of words. It follows this format:
Our linker now has everything it needs to do its job.
Linking Algorithm
Linking Example
Linked code (after a lot of work)
Lecture 23
Heap Management
In this course we have a library to deal with all of the memory management features, including init
, new
, and delete
.
This allows for data to exist in memory that is out of scope, that is, out of the boundaries of your stack frame.
Question
How do we manage this memory?
The memory not on the stack is either in code, or on the heap. The init
procedure initializes a heap for us to use.
This is much more problematic than a stack to take care of. Stacks are nice and ordered. Calls can be made to heaps using delete
or new
in arbitrary orders, so we can’t simply push and pop memory.
In our world:
Code |
---|
Read-only data |
Global data |
Heap Stack |
Code is just data. You can store data in your code. In C, it’s common to separate out static, global information from code, but it’s all just the stuff before the dynamic heap.
Question
How does a heap work?
We have a variety of implementations.
Example 1: No reclamation of Memory and Fixed Blocks
After init
, we get two pointers, one to the start of memory on the heap and one at the end.
Initialization is . Allocation is also . We never delete.
Clearly not the best choice. We will run out of memory quickly since we aren’t reusing reclaimed memory.
Example 2: Explicit Reclamation and Linked List of Fixed-Size Blocks
Keep the fixed size idea the same, but keep track of a free list (linked list of free memory blocks) and we can allocate from this linked list.
Example 3: Variable-Sized Blocks
We once again used a linked list but here our linked list will store a number of bytes and the next node.
Init: Start with the entire heap being free.
Let’s say we want to allocate 50 bytes. What we will do is allocate 54 bytes.
The first 4 bytes are the size of the block (integer), and the rest is the requested bytes. We need this bookkeeping because delete doesn’t take a size. We return a pointer to the start of the 50 bytes.
Memory:
54 | start 50 | end 50 |
---|
Free List is one node with and (since we started with bytes at address ).
Next, we allocate 28 bytes. We allocated 32 bytes and return a pointer to the start of the 28 bytes.
54 | 32 | end 28 |
---|
Free list is one node with and
Freeing the 50 bytes results in
32 |
---|
Free list is pointing to .
Freeing the other 32 bytes results in a free list of pointing to pointing to
We can do some consolidation. Notice that and so the first two nodes in our linked list can collapse to a single node.
We can further consolidate since .
The biggest issue with this approach is fragmentation. Suppose we have 48 bytes and we make the following calls:
- Allocate
- Allocate
- Allocate
- Free
- Allocate
We can’t allocate , despite having bytes free.
We’ve been showing the free list as a separate data structure. But we’re the ones allocating the data structures. So how do we allocate space for the free itself?
The free list goes in the space itself. It is free, so the user doesn’t care what we put there.
Lecture 24
Dealing with Fragmentation
Heuristics:
- First fit: Put memory in the first available spot
- Best fit: Put the block in an exact match (or as close to it so there is less waste)
- Worst fit: Exact match if possible, otherwise put the block in the largest available space
Problem: Best- and worst-fit involve looking over the whole free list (so they are slow).
Other ideas include dmalloc
and the binary buddy system.
Binary Buddy System
Start with bytes of heap memory.
Suppose we try to allocate bytes. We need an extra one for bookkeeping (so ). This fits in a block of size . We split memory until we find such a block and reserve the entire block.
256 | 256 |
---|
128 | 128 | 256 |
---|
64 | 64 | 128 | 256 |
---|
32 | 32 | 64 | 128 | 256 |
---|
Binary buddy still creates fragmentation, just of a different sort. If most allocations are of nice powers of two, it defragments well.
Both the size and location of any block can be encoded using a list of “left” and “right” actions.
Binary Buddy Code
Start with a . The code for the whole of the heap is just
For each block division, append a if the left block is selected, or a if the right block is selected.
Since the first bit is always , we can tell the length of the code by looking for the first .
Since each bit represents a power of two, very short codes represent a lot of information.
One word is plenty for a code.
Some languages take care of deallocation.
Second example
In order for automatic memory management to happen, the compiler and the allocator need to coordinate in some way.
At the minimum, the compiler needs to tell the allocator when something goes out of scope.
Technique 1: Reference Counting
For each heap block, keep track of the number of pointers that point to it.
Must watch every pointer and update reference counts each time a pointer is reassigned. The compiler needs to call some procedure provided by the allocator every time a pointer goes into or out of scope.
If a block’s reference count reaches , reclaim it.
Question
What issues are there to this?
If a block points to another block and vice versa, then the cluster is unreachable and should be cleaned.
But both will retain a reference to each other.
Question
If reference counting is so bad, why do people use it?
It’s very straightforward to implement, so it’s an easy way to get some automatic memory management easily.
But it’s expensive and flawed.
Question
Is it possible to fix this issue with reference counting?
Short answer: No. Long answer: What if we delayed counting references, so that we would never get these problematic clusters in the first place.
The remaining techniques are classed as garbage collection.
Technique 2: Mark and Sweep
Scan the entire stack (and global variables) and search for pointers. Mark the heap blocks that have pointers to them. If these contain pointers, continue following.
Then scan the heap, reclaim any blocks that aren’t marked. Boils down to a graph traversal problem.
The compiler needs to tell the allocator
- Where the pointers are in the stack
- Where pointers are in the heap
This means that the compiler and allocator need to agree on both stack layout and object layout.
This is a much greater degree of co-design than is needed for reference counting (so mark and sweep is rarely bolted onto a language post-hoc).
Technique 3: Copying Collector
Split the heap into two halves, say and .
Allocate memory in . Perform a mark-and-copy; but mark by coping objects from into .
After the copy has all living objects stored contiguously.
Once finished copying, begin allocation to (reverse the roles).
Three major benefits
- Allocation is always extremely fast (filling a heap)
- The “sweep” phase is free
- We have no fragmentation in memory
If most objects aren’t long-lived, we don’t actually do much.
Since copying an object from one heap to an object is moving objects, the compiler needs to be prepared for the allocator to modify active memory.
In practice, most objects are short-lived (good for copy), but those objects that aren’t short lived are nearly immortal.
Most practical garbage collectors are generational collectors: they use copying from to , but then is mark-and-sweep.
With generational, we benefit from the free sweep for young objects, and the lack of copying for old objects.