Professor: Semih Salihoğlu | Term: Winter 2025
Lecture 1
Simplified Software Stack of Data-Intensive App
flowchart TB
APP <--> DBMS
DBMS <--> OS
OS <--> id[(Disks)]
Question
Why do app developers need a DBMS?
Productivity and safety
Productivity related features:
- Logical (High-level) Data Model
Enterprise Amazon
App: Mobile checkout App: Product & Shipment Tracking & Logging App: Manager dashboards for analytics
Amazon needs to record
- Customers: names, addresses, ages
- Product: names, descriptions, prices
- Orders: who, what, when
Naive Approach: Use directly the file system
This has many problems (Physical Record Design):
Question
How to serialize records in the files?
Example:
Alice, 18
Bob, 21
Option 1: Special char
Alice, 18 \n
Bob, 21 \n
Could do Huffman encoding
Option 2: Variable length-encoding
5 Alice 18 3 Bob 21 ...
Option 3: Column-Oriented
Alice | Bob |
---|---|
18 | 21 |
Allows for better compression (easier to compress data of the same type)
This stuff has nothing to do with the app logic. Many design options, difficult for developers.
Logical (High-level) Data Model: Instead of interacting with data as bits and bytes allows you to interact with data using a higher-level logical data structure.
Second Productivity Benefit: 2. High-level Query Language
Question
App: Who are the top paying customers?
order.txt
oID | cID | pID |
---|---|---|
01 | c1 | BookA |
02 | c2 | WatchA |
03 | c1 | BookB |
cust.txt
cID | name | age |
---|---|---|
c1 | Alice | 18 |
c2 | Bob | 21 |
product.txt
pID | price |
---|---|
BookA | 20 |
WatchA | 50 |
Need an algorithm to find the top paying customer. One option is a sequential nested loop-join algorithm (order and product) with a hash map aggregate.
Another option: Hash map of product and price first. Then hash map of total sum (product and customer). Slightly better than previous since not opening file multiple times.
This is still too low-level and somewhat detached from app logic Not the right abstraction to access the data.
Now assume the previous files are now tables.
SELECT cID, sum(price) as totalSum
FROM orders, products
WHERE orders.pID = products.pID
GROUP BY cID
ORDER BY totalSum DESC
LIMIT 10
Much more high-level.
Safety Features
- Integrity constraints protect against data corruption due to application bugs (i.e. primary keys)
- Concurrent access to data
- Protection against hardware failure
Information Systems
- Database
- Dumb
- Very fast, scalable, easy to use
- Symbolic AI
- Intelligent
Tldr
DBMS abstract a lot of hard work that doesn’t have to do with specific apps themselves.
Lecture 2
Data Model: A formalism (notation) to describe 3 aspects of a database.
- Structure
- Constraints
- Operations
First Data Model in History: Network Model Charles Bachmann
- Structure
Records (nodes) with field of data types. Modelling data as relations or tables, each storing logically related information together.
UserRecord
uid | name | age | pop |
---|---|---|---|
142 | Bard | 10 | 0.9 |
123 | Milhouse | 10 | 0.2 |
Group
gid | name |
---|---|
abc | A Book Club |
dps | Dead Putting Society |
Member
uid | gid |
---|---|
142 | dps |
123 | abc |
Constraint: Keys: each record needed to have a field that identified the record
Operations:
getRecordByKey(147, UserRecord)
record.getNext(members) --> record
Criticisms:
- Tuple at a time
- Physical dependence
Databases should be modelled to follow set theory.
Relational Model
Structure: A relational database consists of a set of tables/relations, where columns have names called attributes. Attributes have values from a particular domain.
Attributes within relations are unique. Relations are sets, columns are unique.
No duplicate tuples (two rows in a relation in the relational model cannot be the same). Same is to be defined as having equivalent attributes for all attributes.
Schema/Metadata
- Specifies the logical structure of the data
- Attributes & attributes types/domains
Instance
- Represents the actual data content
- Changes rapidly, but conforms to the schema.
Constraints (external to the database; they are application specific assumptions): Some constraints are very common, and have given names, others can be defined.
Tuple-level (domain restrictions; i.e. age cannot be negative) (not the focus)
Key constraints: By definition no two tuples in a relation can be the same on every attribute.
Definition
Super Key: A set of attributes on which no tuples can be the same.
Definition
Candidate Key: A minimal super key.
Definition
Primary Key: One of the candidate keys that the database designer identifies.
Example
Expedia Flights
Name | tripID | tripStartDateTime |
---|---|---|
Alice | 123 | 01/01 12:00pm |
Bob | 123 | 01/01 12:00pm |
Carl | 456 | 02/02 4pm |
Alice | 789 | 7/15 5pm |
We are assuming that names are unique. (Alice is the same Alice in both instances)
Name
is not a key (since there can be duplicate names). tripID
is not a key (since multiple trips booked at same time can have the same tripID
). tripStartDateTime
is obviously not a key.
Name
and tripID
are super keys. This is because a person cannot be on multiple flights at the same time (thus the name
and tripID
together will be unique).
Name
and tripStartDateTime
are super keys.
They are also both candidate keys (since they are minimal).
Name
, tripID
, tripStartDateTime
are super keys but not candidate keys (since not minimal).
Primary Key has to be a candidate key.
Definition
Foreign Key Constraint: Specific instance of “Referential Constraints”.
An attribute of references the primary key of .
It is the primary key of one relation appearing as the attribute of another relation.
Users
uid | name | age |
---|---|---|
142 | Bart | 10 |
123 | Millhouse | 10 |
857 | Joe | 8 |
Members
uid | gid |
---|---|
123 | gov |
857 | abc |
Group
gid | name |
---|---|
gov | Student Council |
abc | Book Club |
Anything that appears in Members, must appear in Group. Anything that appears in Members, must appear in User.
uid
and gid
are both foreign keys.
Operations: Relational Algebra
Operator of tuples
- Standard set operations
- ()
- Remove parts operator
- Selection, projections
- Relation widening operators
- Cartesian Product
- Renaming
Core Operator 1: Selection/Filter
, predicate, can contain constants, .
results in the following
uid | name | age |
---|---|---|
857 | Joe | 8 |
results in
uid | name | age |
---|---|---|
142 | Bart | 10 |
Core operator 2: Projection output keeps only the columns of that appear in .
is the projection list: a set of attributes that will be in the output.
results in
uid | Name |
---|---|
142 | Bart |
123 | Millhouse |
857 | Joe |
results in
age |
---|
10 |
8 |
Duplicate rows are removed (by definition).
Core Operator 3: Cartesian Product:
The schema of : contains every attribute in and every attribute in .
If an attribute appears both in and , we rename it as and in the output.
Example
Users Members
Users.uid | name | age | Members.uid | gid |
---|---|---|---|---|
142 | Bart | 10 | 123 | gov |
142 | Bart | 10 | 857 | abc |
123 | Millhouse | 10 | 123 | gov |
123 | Millhouse | 10 | 857 | abc |
857 | Joe | 8 | 123 | gov |
857 | Joe | 8 | 857 | gov |
There should be rows since we are taking the Cartesian product (each row with each row).
Derived Operator 1: Theta Join:
Take the cartesian product but only show the ones that satisfy .
Derived Operator 2: Natural Join:
Join and on equality of their common attributes and then keep only one copy of each common attribute.
Users Members
User.uid | name | age | Members.uid | gid |
---|---|---|---|---|
123 | Millhouse | 10 | 123 | gov |
857 | Joe | 8 | 857 | abc |
We then remove one of the duplicate columns and can then disambiguate.
name | age | uid | gid |
---|---|---|---|
Millhouse | 10 | 123 | gov |
Joe | 8 | 857 | abc |
This operator is shorthand for where equates each pair of columns common to and and is the union of column names from and .
Lecture 3
Core Operator 4: Union
Input: Two tables and .
Notation: . and must have identical schema.
Output:
- Same scheme as and .
- Contains all rows in and all rows in (duplicate rows removed)
Core Operator 5: Difference
Notation: . and must have identical schema.
Output:
- Has the same schema as and
- Contains all rows in that are not in
Derived Operator 3: Intersection
Notation: . and must have identical schema.
Output:
- Has the same schema as and
- Contains all rows that are in both and
Shorthand for . This finds tuples in not in , and removes those tuples from .
This is also equivalent to and .
Core Operator 6: Renaming
Input: A table (or an expression)
Notation: , or
Purpose: “Rename” a table and/or its columns
Output: A table with the same row as , but called differently.
Member
uid | gid |
---|---|
123 | gov |
857 | abc |
M1
uid | gid |
---|---|
123 | gov |
857 | abc |
This does not modify the database (it is a renamed table as a copy of the original).
Example
Renaming: IDs of users who belong to at least two groups Member Member
Example
Find the names of
Users
whose ages are greater than theUser
withID
857
Users
uid | name | age | pop |
---|---|---|---|
102 | Bert | 10 | 0.8 |
123 | Millhouse | 10 | 0.8 |
857 | Lisa | 8 | 0.3 |
Members
uid | gid |
---|---|
102 | gov |
123 | abc |
857 | gov |
Groups
gid | name |
---|---|
gov | Student Gov |
abc | Book Club |
(Users) (Users).
Key Constraints
Example
, suppose is a key.
In relational algebra,
Example
Additional Constraints
Tuple-level constraints: Age of Users cannot be negative
(Users)
Relation-level constraints: Key constraints
uid
should be unique in the User relation
DB-level constraints: Foreign key constraints
- A user cannot be a member of 3 groups or more
Name | Notation | Core/Derived | Monotone |
---|---|---|---|
Selection | Core | Yes | |
Projection | Core | Yes | |
Cartesian product | Core | Yes | |
Join, Natural Join | Core | Yes | |
Union | Core | Yes | |
Intersection | Derived | Yes | |
Difference | Core | Monotone w.r.t. , Non-monotone w.r.t. |
Example
IDs of groups Lisa belongs to?
Example
IDs of groups that Lisa does not belong to?
Example
Find User with
pop
all other users.
Definition
Monotone Operators: An operator is monotone if inserting tuples to ‘s operands cannot remove tuples from ‘s output.
Any combination of monotone operations gives a monotone relational algebra expression.
If some old output rows may become invalid the operator is non-monotone.
The difference operator is non-monotone. Take . If we add something to , then it may change the output of (removing a tuple).
Question
Why is needed for highest?
Composition of monotone operators produces a monotone query.
- Old output rows remain correct when more rows are added to the input
The highest query is not monotone. The current highest becomes invalidated when the new highest joins.
Relational Calculus (Optional)
Relational Algebra: Procedural language
- Queries are expressed by applying a sequence of operations to relations
Relational Calculus: Declarative language
- A logical formalism in which queries are expressed as formulas of first-order logic. CS245
Theorem
Codd’s Theorem: Relational Algebra and Relational Calculus are essentially equivalent in terms of expressive power.
Example
User first-order logic formulae to specify properties of the query answer. Who are the most popular?
Relational algebra is safe relational calculus. A query is safe if, for all database instances conforming to the schema, the query results can be computed using only constants appearing in the database instance of the query itself.
Relational algebra does not have recursion.
SQL: Structured Query Language (supported by most DBMS)
Data-definition language (DDL): Define/modify schemas, delete relations.
Data-manipulation language (DML): Query information, and insert/delete/modify tuples
Lecture 4
DDL:
CREATE TABLE Users(uid INT, name varchar(30), age INT, pop FLOAT);
DROP TABLE Users;
DML: SFW statements (also called SPJ)
SELECT A_1, A_2, \ldots, A_n
FROM R_1, R_2, \ldots, R_m
WHERE condition;
Corresponds to (but not equivalent) to
Example
Names of Users, younger than 18
SELECT name
FROM Users
WHERE age < 18
Example
What year was Lisa born in?
SELECT (2025-age)
FROM users
WHERE name = "Lisa"
Example
Join: IDs of groups Lisa belongs to?
SELECT gid
FROM Users, Members
WHERE name = "Lisa" AND User.uid = Members.uid
Example
Aliasing: Pairs of users who belong to at least one common group.
Both expressions in SELECT
and tables in FROM
SELECT DISTINCT m1.uid, m2.uid
FROM Members AS m1, Members AS m2 -- cartesian product
WHERE m1.gid = m2.gid and m1.uid != m2.uid
This does not account for “duplicate” tuples (123, 857) and (857, 123).
SELECT DISTINCT m1.uid, m2.uid
FROM Members AS m1, Members AS m2
WHERE m1.gid = m2.gid and m1.uid < m2.uid
SQL relations are by default bags and not sets. I.e. they can contain duplicate tuples. To remove duplicates, we use SELECT DISTINCT
.
The AS
is optional. The below is equivalent.
SELECT DISTINCT m1.uid, m2.uid
FROM Members m1, Members m2
WHERE m1.gid = m2.gid and m1.uid < m2.uid
LIKE
for string pattern matching
SELECT * -- shortcut for everything
FROM Users
WHERE name LIKE "%l%" -- matches 0 or more arbitrary chars
WHERE name LIKE "%l" -- names that end with l
WHERE name LIKE "l%" -- names that start with l
WHERE name LIKE "_l%" -- match exactly one arbitrary char
-- all names whose second letter is l
Set Operations: By default SQL assumes bags semantics. i.e. base relations and outputs of operations return bags.
For Set operations: there are two versions.
UNION, INTERSECT, EXCEPT
These adopt set semantics.
UNION, INTERSECT, EXCEPT ALL
These assume bag semantics.
Fruits1
name |
---|
orange |
apple |
apple |
apple |
grape |
Fruits2
name |
---|
orange |
apple |
apple |
((SELECT * FROM Fruits1)
UNION ALL
(SELECT * FROM Fruits2))
The output is:
name |
---|
orange |
orange |
apple |
apple |
apple |
apple |
apple |
grape |
grape |
((SELECT * FROM Fruits1)
UNION
(SELECT * FROM Fruits2))
Output:
name |
---|
orange |
apple |
grape |
INTERSECT ALL
of and means for each , take the minimum multiplicity of in and the minimum multiplicity of in and appears that many times in the output.
INTERSECT
of and means turn and into sets first and then apply set intersect.
((SELECT * FROM Fruits1)
INTERSECT ALL
(SELECT * FROM Fruits2))
Output:
name |
---|
orange |
apple |
apple |
((SELECT * FROM Fruits1)
EXCEPT ALL
(SELECT * FROM Fruits2))
Appear in Fruits1
but not Fruits2
. Subtract multiplicities (negatives don’t exist).
name |
---|
grape |
apple |
EXCEPT ALL
of and means for each , appears in the output “multiplicity of in the multiplicity of in “.
EXCEPT
of and means turn and by default into sets and then take set difference.
Subqueries/Query Composibility
In relational algebra and SQL, outputs of queries are relations, so they can be used as inputs to other queries.
Pokes
uid1 | uid2 | timestamp |
---|---|---|
123 | 857 | Monday 12:00 |
857 | 457 | Tuesday |
857 | 123 | Friday |
Table Subqueries:
Example
Names of users who have poked others more than they were poked themselves
FROM Users, ((SELECT uid1 FROM Pokes))
EXCEPT ALL
((SELECT uid2 FROM Pokes)) AS T
WHERE Users.uid = T.uid1
SELECT DISTINCT names
Could also make the subquery distinct. But this code makes the final distinct.
Scalar Subqueries
If returns a single value, i.e. a single row and a single column, then it can be used as a literal value in any expression. e.g. WHERE
, SELECT
, etc..)
SELECT *
FROM Users
WHERE age = (SELECT age
FROM Users
WHERE uid = 123);
Select all users whose age is the same as User with uid
of 123.
IN Subqeries
If returns a single column, then can be used in a boolean expression using IN.
SELECT *
FROM Users,
WHERE age IN (SELECT age
FROM Users
WHERE name = 'Alice');
”Return me all the users whose ages are the same as some Alice.”
SELECT *
FROM Users,
WHERE age NOT IN (SELECT age
FROM Users
WHERE name = 'Alice');
Exists Subqueries
Check if a subquery is empty or not, and can be used in boolean expressions.
SELECT *
FROM Users
WHERE EXISTS (SELECT *
FROM Members
WHERE Members.uid.= Users.uid)
All users who are members of some group (everything about that member).
Quantified Subqueries
Universal quantifier (): All
SELECT *
FROM Users
WHERE pop >= ALL(SELECT pop FROM Users);
Returns true in the result of subquery,
Existential quantifier (): Any
SELECT *
FROM Users
WHERE NOT (pop < ANY(SELECT pop FROM Users));
Lecture 5
Revisit SFW Semantics
SELECT [DISTINCT] E1, E2, ..., En
FROM R1, R2, ..., Rm
WHERE <condition>
for each tuple , if each condition is true on , evaluate each of which returns a scalar value and outputs 1 tuple.
SELECT name, EXISTS [SELECT age FROM Users u1 WHERE u1.uid = 143]
FROM Users
WHERE age > 18
Returns tuple of [name, boolean].
Aggregates
Standard aggregation functions: COUNT, SUM, AVG, MIN, MAX
Aggregating over entire relations
SELECT count(*) avg(pop)
FROM Users
WHERE age > 18
Output contains one tuple.
SELECT COUNT(DISTINCT age) -- counts disctinct number of age values that appear in the Users table
FROM Users
SELECT COUNT(DISTINCT uid)
FROM Members
-- number of users who have registered to at least one group
Members table contains [uid, gid].
Grouping
SELECT age, avg(pop) -- hundred tuples and one tuple. bad query
FROM Users
(3) SELECT age, avg(pop)
(1) FROM Users
(2) GROUP BY age
Users
uid | age | pop |
---|---|---|
143 | 10 | 1.0 |
457 | 8 | 0.2 |
854 | 10 | 0.2 |
555 | 8 | 0.2 |
Group 1: 143, 854
Group 2: 457, 555
Compute the average popularity for each age group.
Query returns the following table:
age | avg(pop) |
---|---|
10 | 0.6 |
8 | 0.2 |
avg(DISTINCT pop)
may result in a different value if there are duplicate pop
for the same age.
Notes:
- Queries with aggregate functions that don’t have
GROUP BY
means having a single group. - Every expression in
SELECT
has to either appear inGROUP BY
OR be an aggregation function.
HAVING: Syntactic sugar to apply a predicate over a group before outputting.
List average popularity of each age group, but only if the group contains people. (i.e. if the group is large enough)
SELECT age, avg(pop), count(*) as cnt
FROM Users
GROUP BY age
HAVING cnt > 100 -- only output if this is true
Ordering
(4) SELECT
(1) FROM
(2) WHERE
(3) GROUP
(5) HAVING
Order By: Comes after optional {sql}GROUP BY
and {sql} HAVING
and sorts the output table according to some expressions
SELECT uid, name, pop
FROM Users
ORDER BY pop DESC, name ASC
By default [ASC, DESC] is ascending.
LIMIT: Specifies number of tuples to return and after and optional {SQL}ORDER BY
. It is often useful to be used in conjunction with {sql} ORDER BY
to answer top or bottom queries.
SELECT
FROM
WHERE
GROUP
HAVING
ORDER BY
LIMIT
SELECT name, pop
FROM Users
ORDER BY pop DESC
LIMIT 3
Gives name and popularity of 3 most popular users.
OFFSET k
lets you skip rows.
NULL: How to represent information that is not yet known or not yet applicable
Option 1: Use a special value
- integers:
- strings: N/A
not a very practical idea in general
Option 2: Special is_valid
column
name | name_is_valid | age | age_is_valid |
---|---|---|---|
Bart | true | 0 | false |
ruins the schema
Option 3: Decompose tables to each attribute
Users
uid | name | age |
---|---|---|
Situation: Bart’s age is 10 and Nelson’s age is not known.
UserAge
uid | age |
---|---|
143 | 10 |
UserName
uid | name |
---|---|
143 | Bart |
555 | Nelson |
complicates things
Special SQL solution for this situation: NULL
uid | name | age | pop |
---|---|---|---|
555 | Nelson | NULL | NULL |
143 | Bart | 10 | 0.8 |
Rules about NULL
values:
Any comparison of any sort with a NULL
value evaluates to NULL
NULL
NULL
In WHERE
and HAVING
if a boolean expression evaluates to NULL
, we treat it as false.
SELECT count(*)
FROM Users
WHERE pop=pop
NULL
NULL
NULL
Boolean truth table with NULLS
T | F | T | T |
T | F | F | T |
T | N | N | T |
F | T | F | T |
F | F | F | F |
F | N | F | N |
N | T | N | T |
N | F | F | N |
N | N | N | N |
T | F |
F | T |
N | N |
Rules continued:
- Aggregation functions ignore
NULL
values, exceptcount(*)
SELECT avg(pop) FROM Users -- Compute the average of non-NULL popularity values.
uid | name | pop |
---|---|---|
- | - | 1.0 |
- | - | 0.8 |
- | - | NULL |
The above query will return .
This may not be equal to
SELECT sum(pop)/count(*) FROM Users -- sum sums non-nulls. count counts all tuples
This would evaluate to .
SELECT age, avg(pop)
FROM Users
GROUP BY age
DuckDB will group NULL
s and take their average.
uid | name | pop | age |
---|---|---|---|
- | - | 1.0 | NULL |
- | - | 0.8 | 10 |
- | - | NULL | 8 |
- | - | 0.2 | NULL |
- | - | 0.5 | 8 |
age | avg(pop) |
---|---|
NULL | 0.6 |
10 | 0.8 |
8 | 0.5 |
SELECT sum(pop)/count(*) FROM Users
WHERE pop is not NULL
Lecture 6
OUTER JOINS: Count the number of members in each group, including groups with no member
Users
uid | name | age | pop |
---|---|---|---|
Member
uid | gid |
---|---|
143 | gov |
857 | dps |
143 | dps |
Groups
gid | name |
---|---|
xyz | secret society |
gov | student gov |
dps | dead poet society |
SELECT gid, count(*)
FROM Members, Groups
WHERE Members.gid = Groups.gid
GROUP BY Groups.gid
The above doesn’t work
Left/Right/Full Outer joins
left outer join
union with all tuples in that did not participate (i.e. did not produce any outputs) in the , append with nulls for each attribute coming from .
Groups Members
gov | student gov | 143 | gov |
---|---|---|---|
dps | dead poet | 857 | dps |
dps | dead poet | 143 | dps |
xyz | secret society | null | null |
right outer join keeps dangling tuples from (appended with nulls for attributes from ).
full outer keeps dangling tuples from both and
SELECT gid, count(*)
FROM Members RIGHT OUTER JOIN GROUP ON Members.gid = Groups.gid
GROUP BY Groups.gid
However we probably want the null
to be 0
instead.
SELECT gid, count(uid)
FROM Members RIGHT OUTER JOIN GROUP ON Members.gid = Groups.gid
GROUP BY Groups.gid
SQL Syntax: Two different decisions you have to make to figure out your join:
-
Join Predicate
- Natural () (discards rows that don’t match) vs Theta () (only keeps rows that satisfy ) (default)
- If you want to do natural join, you have to use
NATURAL JOIN
keyword. If you omitTHETA JOIN
- If you want to do natural join, you have to use
- Natural () (discards rows that don’t match) vs Theta () (only keeps rows that satisfy ) (default)
-
Dangling tuples
- Outer Join (keep unrelated tuples) vs Inner join (removes unrelated tuples) (default)
For each group, output the names of the members. If a group has no members, still output with null values.
SELECT *
FROM (Users NATURAL JOIN Members) RIGHT OUTER JOIN Groups ON Members.gid = Groups.gid
INSERT/DELETE/UPDATE
INSERT INTO Members VALUES (789, 'gov')
INSERT INTO Users(uid, name) VALUES (789, 'Alice')
INSERT INTO Members VALUES (SELECT uid, 'dps' as gname -- see table below
FROM Users)
Constant expression
uid | gnames |
---|---|
143 | dps |
857 | dps |
789 | dps |
Pick every uid
in user table, and give it the dps
gid
in Members.
DELETE
DELETE FROM Members -- removes all tuples
DELETE FROM Members
WHERE uid = 143
UPDATES
UPDATE Users Set name = 'Bob' -- will set every users name to Bob
UPDATE Users Set name = 'Bob'
Where name = 'Bart' -- set all Barts to Bob
UPDATE Users
SET pop = (SELECT avg(pop)
FROM Users) -- the subqueries in updates run on the older version of the tables
Constraints Declared as part of table definitions.
CREATE TAVLE Users (
uid INT
name varchar(30) NOT NULL,
SIN INT,
PRIMARY KEY(name, SIN))
Referential Integrity Constraints
CREATE TABLE Members (
uid INT REFERENCES Users(uid),
gid varchar(30) REFERENCES Groups(gid) )
)
CREATE TABLE MemberBenefits(
uid INT,
gid INT,
benefit1 varchar(100);
FOREIGNKEY (uid, gid) REFERENCES Members(uid, gid)
)
Note on Terminology: IN SQL, the referenced attributes do not have to be primary keys of the referenced table. They can also be attributes that have uniqueness constraints.
Member is the referencing table (where uid
is the referencing attributes). User is the referenced key where uid
is the reference attribute).
Users
uid | name |
---|---|
143 | joe |
Members
uid | gid |
---|---|
143 | dps |
Question
What happens if we insert into members (555, dps)
It is rejected
Question
What about deletions from Users
Users
uid | name |
---|---|
Members
uid | gid |
---|---|
143 | dps |
Several options:
- Reject
- Cascade (delete the items it refers to in Members)
- Set the referring address to
NULL
(if the referring attribute can store nulls)
Lecture 7
Database Modifications order Referential Integrity Constraint
Question
What happens when the referencing and referenced tables are modified?
Insertions into Users
Deletions from Members
Insertions into members - need to check if the referencing after is dangling
Delete from Users need to check if there are any referencing tuples ion Members and if so, there are three options
- By default, the deletion will be rejected
ON DELETE CASCADE
: cascade the delete to referencing tuplesON DELETE SET NULL
: Set the referencing dangling attributes to nullable (only if the referencing attributes can be NULL) (Updates = delete + insert) (we can still cascade)
CREATE TABLE Members (
uid INT REFERENCES Users(uid) ON CASCADE DELETE
)
Tuple or Attribute level constraints: CHECK
that the system will enforce and holds on every tuple. If the predicate evaluates to NULL, it is treated as not violates the constraint.
CREATE TABLE Users(
age INT CHECK (age > 10)
)
Users
uid | name | age |
---|---|---|
142 | Bart | 11 |
857 | Millhouse | 12 |
We cannot add tuples with age below 10. For each tuple inserted, the check predicate will run.
CREATE TABLE Products (
Price INT,
discPrice INT,
CHECK(discPrice < price)
)
SQL Standard allows more “general check” constraints
CREATE TABLE Members (
uid
gid
CHECK (uid IN ( SELECT uid
FROM Users))
)
This is allowed, but not recommended. If Users is joined with more things, then it has to be checked anytime one of those other tables is checked. Very expensive and hard to enforce.
Naming constraints you can name your constraints as follows.
CREATE TABLE Products (
price INT CONSTRAINT highPriceConst CHECK(price > low)
)
We can drop constraints easily this way.
Schema Changes: {sql} ALTER TABLE
{sql}ALTER TABLE tablename ADD COLUMN colName colType
{sql} ALTER TABLE tablename DROP COLUMN columnname
- `{sql} ALTER TABLE tablename RENAME COLUMN oldName TO newName
{sql} ALTER TABLE tablename ALTER Column oldName newDataType
needs to be castable to new data type{sql}ALTER TABLE ADD CONSTRAINT rk_users FOREIGN KEY uid REFERENCES User(uid)
{sql} ALTER TABLE tablename DROP CONSTRAINT highPriceConst
Definition
VIEWS: Virtual derived tables that are defined as arbitrary SQL queries (by default completely virtual but part of the database schema)
- Base tables
- Views virtual tables
- Temporary tables
CREATE VIEW XYZGroupMembers(uid, name, age) AS
(SELECT Users.uid, name, age
FROM Users, Members
WHERE Users.uid = Members.uid and Members.gid = 'xyz')
DROP VIEW XYZGroupMembers
Selecting all the users in the xyz
group. We can treat this as if it was a base table in our queries (but it is completely virtual).
SELECT * from XYZGroupMembers
Pros: Simplified your queries e.g. can reduce sub-queries you have to write. Often, many users of the database will only care about different derived tables and not the raw base tables. Good for authorization.
Definition
Materialized View: Tuples of the view is computed and stored in the database
Pros: Very good idea because no query is faster than a simple lookup in an already computed table.
Cons: Hard to keep them fresh as base tables change.
Incremented view maintenance how to automatically meep views updated.
Very challenging problem.
Question
Should views be udpatable?
Generally no, but SQL Standard defined a set of restricted views (basically on a single table) that can be updatable but generally supported by systems.
CREATE VIEW AvgPopTable(avgpop) AS
(SELECT avg(pop)
FROM Users)
-- Single column, single value.
If we materialize this:
CREATE MATERIALIZE VIEW AvgPopTable(avgpop) AS
(SELECT avg(pop)
FROM Users)
UPDATE AvgPopTable SET avgPop = 0.5; -- ambiguous how to propagate the base tables -> not allowed
Definition (WITH)
Temp Tables: Defined within the scope of a single query, not part of the database schema
WITH tmp(uid) AS (
SELECT uid FROM Users
WHERE age > 10)
)
SELECT gid from Members, tmp
WHERE Members.uid = tmp.uid
Note: You need to use them as tables and not string substitutions
WITH tmp as (...)
SELECT *
FROM Members
WHERE Members.uid IN tmp;
-- THIS IS NOT VALID SYNTAX
WITH tmp as (...)
SELECT *
FROM Members
WHERE Members.uid IN (SELECT *
FROM tmp);
Lecture 8
Motivating example of using indexes.
SELECT * FROM User WHERE name = 'Bart';
Question
Can we go directly to rows with name=‘Bart’ instead of scanning the entire table?
An index is an auxiliary persistent data structure that helps with efficient searches.
CREATE [UNIQUE] INDEX indexname ON tablename(col1,..., coln)
;
With UNIQUE
, the DBMS will enforce that is a key of tablename
.
An index on can speed up accesses of the form
- value
- value
An index on can speed up
- valuevalue
The ordering of index columns is important. (The first/major column is ordered, and tiebreaks are broken by the second/minor column).
Programming Applications with SQL
Challenge of using SQL on real app:
- Not intended for general-purpose computation
Solutions
- Augment SQL with constructs from general-purpose programming languages
- Use SQL together with general-purpose programming languages
- Through an API
- Embedded SQL e.g. in C
- SQL generating approaches: Web Programming Frameworks (e.g., Django)
- Augmenting SQL: SQL/PSM
An ISO standard to extend SQL to an advanced programming language
Several systems adopt this partially (e.g. MySQL, PostgreSQL)
PSM = Persistent Stored Modules
- Working with SQL through an API
E.g.: Python psycopg2, JDBC, ODBC
The application program sends SQL commands to the DBMS at runtime. Gets back a “cursor” that can iterate over results.
- Based on the SQL/CLI standard
Application program sends SQL commands to the DBMS at runtime. Gets back a “cursor” that can iterate over results.
Results are converted to objects in the application program.
Functionalities provided:
- Connect/Disconnect to a DBMS ⇒ get a connection object
- Execute SQL queries
- Iterate over result tuples
conn.set_session(autocommit=True)
bar = input('Enter the bar to update: ').strip()
beer = input('Enter the beer to update: ').strip()
price = float(input('Enter the new price: '))
try:
cur.execute('"
UPDATE Serves
SET price = %s
WHERE bar = %s AND beer = %s'", (price, bar, beer))
if cur.rowcount != 1:
print('{} row(s) updated: correct bar/beer?'\
.format(cur.rowcount))
except Exception as e:
print(e)
A lot of overhead.
Prepared statements: Example
cur.execute('"# Prepare once (in SQL).
PREPARE update_price AS # Name the prepared plan,
UPDATE Serves SET price = $1 # and note the $1, $2, … notation for
WHERE bar = $2 AND beer = $3'") # parameter placeholders.
while true: # Input bar, beer, price…
cur.execute(‘
EXECUTE update_price(%s, %s, %s)',\ # Execute many times.
(price, bar, beer))…. # Check result...
The school probably had
cur.execute("SELECT * FROM Students " + \
"WHERE (name = " + name + "')")
This is an SQL injection attack.
Pros of augmenting SQL:
- More processing features for DBMS
- More application logic can be pushed closer to data
Cons of augmenting SQL:
- SQL is already too big
- Complicate optimization and make it impossible to guarantee safety
- Embedding SQL in a host language
Can be thought of as the opposite of SQL/PSM
Extends a host language e.g., C or Java, with SQL-based features
Can compile host language together with SQL statement sand catch SQL errors during application compilation time
- Web Programming Frameworks
A web development “framework” e.g., Django or Ruby on Rails
Very frequent approach to web apps that need a DB.
For most parts, no explicit writing of SQL is needed.
SQL Recursion: of of the main shortcomings of relational algebra is no recursion
Need to extend SQL when we need to perform some recursive computation
Flights
from | to | cost |
---|---|---|
A | C | … |
A | B | … |
B | D | … |
D | F | … |
C | E | … |
C | D | … |
E | F | … |
Question
Find all pairs of cities such that there is a direct or indirect flight path from to
This is a transitive closure query.
If you knew the “depth” of your flight graph, one approach would be to write a query for each depth and union/
SELECT fr, to
FROM Flights UNION
This is for length of 2.
SELECT F1.from, F2.to
FROM Flights F1, Flights F2
WHERE F1.to = F2.from
We would need to this with length of 3, 4, etc.
This won’t work if you don’t know the depth, and often we won’t know the depth. Also very error prone code if depth is large.
SQL’s solution: Recursive Common Table Expressions (CTEs) that are “fixed points” of recursively defined temporary relations using the WITH RECURSIVE
clause.
In math, fixed of a function starting from . It means the value the following sequence converges to:
If is a positive number, and you start from an arbitrary positive number , this converges to .
WITH RECURSIVE T AS Q
means is the fixed point of starting from where refers to .
fixed point, and so that is .
Overall syntax:
WITH Recursive T as Q -- Q = SELECT * FROM T, Flights..
-- from recursive table
Q* -- that refers to T
-- Q* = SELECT * FROM T
-- actual table
WITH RECURSIVE AP(fr, to) AS (
SELECT AP.fr, F.to
FROM AP, Flights F
WHERE AP.to = Fr.fr
)
This is wrong. The fixed point of a recursive relation is defined as the fixed point from the empty set. is the empty set, and will be the empty set because and the join will be empty.
Fix: Write as a union of
WITH RECURSIVE AP(fr, to) AS (
(SELECT fr, to FROM Flights) -- Qb
UNION
(SELECT AP.fr, F.to
FROM AP, Flights F
WHERE AP.to = F.from))
Very easy to write queries that won’t converge. In practice, it is the users responsibility to ensure that recursive relations will converge.
Question
When do recursive relations converge?
Lecture 9
SQL Reursive Relation
is defined as fixed points of a query , starting from .
Example: All pairs of cities such that can reach through 1 or more flights
WITH RECURSIVE AP(fr, to) AS (
--Qb
(SELECT DISTINCT(*) FROM Flights)
UNION
--Qr
(SELECT DISTINCT AP.fr, F.to
FROM AP, Flights F
WHERE AP.to = F.fr))
--the entirety of the above is Q(AP)
that refers to AP and this is the final query will use AP.
(This is just flights)
fr | to |
---|---|
A | B |
A | C |
B | D |
C | D |
C | E |
D | F |
E | F |
F | G |
This is all pairs reachable with 1 hops
- Flights
A | D |
---|---|
B | F |
A | E |
C | F |
D | G |
F | G |
All flights reachable with hops
- Flights (every pair that is reachable with 2 or 3 hops)
Together, this is all pairs reachable with hops
- Flights (every pair that is reachable with 2 or 3 or 4 hops)
.
SELECT fr
FROM AP
WHERE to='F'
Not every will converge.
SWITH RECURSIVE T AS (
(SELECT *FROM R)
UNION
(SELECT SUM(*) FROM T)
)
Question
What guarantees convergence?
In general, if is monotone on , and is limited to core relational algebra, will converge.
Key takeaway: Something being monotone (with some caveats) is a sufficient condition, but not necessary. That is, we can write non-monotone queries that also converge.
It is your responsibility to ensure that your recursive queries converge. Systems will not help much.
Recall: is monotone on a relation , if we compute and we add tuples to , i.e. , then
Argument per why being monotone on guarantees convergence (if you limit to core relational algebra).
Any tuple that is output by contains values that came from the base relations, which are finite.
Therefore can contain a finite number of tuples.
So if is monotone on at some finite number of iterations it has to converge.
WITH REC T
SELECT * FROM R
UNION
(SELECT * FROM T)
Don’t do this.
WITH REC T
SELECT * FROM R
UNION
(SELECT * FROM T)
UNION
(SELECT x+1 FROM R)
This is no longer finite.
Linear vs non-linear recursion
Linear: refers to once
Non-linear: Refers to > 1
WITH RECURSIVE AP AS
()
UNION
(SELECT AP1, AP2
FROM AP AP1, AP, AP2
WHERE AP1.to AP2.fr
Flights
Flights Pairs of exactly 2 hops
Flights pairs of exactly 2, 3, or 4 hops
In fact: SQL Standard does not allow non-linear recursive queries and won’t allow many operations such as aggregations that might lead to non-monotoness.
Mutually recursive relations:
WITH RECURSIVE
T1 as Q1,
T2 as Q2,
...
T1 as Qi
Natural: n: 1, 2, 100
WITH RECURSIVE
EVEN AS (SELECT n FROM Natural)
ODD AS ( SELECT * FROM Natural
WHERE n = 1)
UNION
(SELECT n FROM Natural
WHERE n+1 IN (SELECT *
(FROM Even)))
Question
If we add to R’ = R \cup \Delta ROUT\Delta OUT$?
When is linearly recursive
Advantage of linear recursion is if is monotone, a system can only use at each step of the fixed point computation instead of , to compute and determine if converged.
Lecture 10
Theory of Normal Forms:
Goal: How do we formally tell apart good database designs from bad ones.
Inst Dep
iID | name | salary | depName | bldg | budget |
---|---|---|---|---|---|
111 | Alice | 5k | CS | DC | 20k |
222 | Bob | 4k | Phy | Ph | 30k |
333 | Carl | 7k | CS | DC | 20k |
111 | Alice | 5k | Phy | Ph | 30k |
Informally: Bad designs have avoidable redundancy.
Why avoid redundancy:
- Hard to maintain databases and keep ihpo consistent
- Your database size goes up and system slows down.
Note: To understand redundancy, you need information about application constraints and application constraints are external information.
Upshots:
flowchart LR
x[DB] ---> a[DB Design Theory]
y[Ext App Constraints - FD] ---> a[DB Design Theory]
a ---> b[Good vs Bad Design - normal form]
flowchart LR
a[DB bad design] ---> b[Decomposition Tool]
c[App Constraints] ---> b[Decomposition Tool]
b ---> e[Good Design]
Designdata for decomposition:
- Lossless: If is decomposed into we want
iID | Name | Salary | Dept |
---|---|---|---|
111 | Alice | 5k | CS |
222 | Bob | 4k | Phy |
333 | Carl | 3k | CS |
111 | Alice | 5k | Phy |
depName | Bldg | Budget |
---|---|---|
CS | DC | 20k |
Phy | Ph | 30k |
This is lossless.
- Locality of Constraints: If the app has a constraint , that we could check directly on , then we ideally want to check still on one relation.
Constraint: Each student can align with advisor(s) from one department.
sID | iID | idepName |
---|---|---|
s1 | i1 | d1 |
s2 | i2 | d2 |
sid | iID |
---|---|
s1 | i1 |
s2 | i2 |
iID | idep |
---|---|
i1 | d1 |
i2 | d2 |
We can’t check directly when the relation is split.
Expressing app Constraints: Functional dependencies generalized uniqueness constraints
Definition
A FD , where and are sets of attributes, holds in for each possible value, there is only one possible value.
More formally: If and then
You can express key constraints as FDs.
depName (bldg, budget)
(iID, depName) (name, salary, bldg, budget)
If holds and i.e., , is all the attributes in , then is a super key of .
Implied FDs: Armstrong’s Axioms:
- Reflexivity: If then trivially .
- Augmentation: If , then
- Transitivity: If , and , then
Derived Rules based on Armstrong’s axioms: 4. Decomposition: If , then and
Proof:
Assume
- Union: If and , then
- Pseudo-Transitivity: If and , then
Close of a set of FDs : () is all FDs implied by
Example:
- iID (name, depName)
- depName bldg
These are both in
iID iID, iID.name iID.name iID bldg (by transitivity)
InstProj
iID | name | projID | projName | projDep | hours | funds |
---|---|---|---|---|---|---|
- iID name
- projID projName, projDep
- (iID, projID) hours
- (projDep, hours) funds
Question
Prove (iid, projID) funds
Proof:
Algorithm for computing (see slides) set and in iterations using one or two FDs in and Amrstrong’s axioms, try to derive new DFs until convergence.
Boyce - Codd Normal Form (BCNF):
Definition
Given FDs and a relation , is in BCNF such that (meaning ‘s attributes contain ). One of the two following conditions hold:
- is trivial, i.e. or
- is a super key in
Consider the following FDs on InstDep
- iID (name, salary)
- depName (bldg, budget)
Question
Is InstDep in BCNF?
No, because iID (name, salary). iID is not a key.
R1
iID | name | salary | depName |
---|
R2
depName | bldg | budget |
---|---|---|
Divide into more
Decompose this more into
iID | name | salary |
---|---|---|
and
iID | depName |
---|---|
Lecture 11
iid | name | salary | depName | bldg | budget |
---|---|---|---|---|---|
111 | Alice | 5k | CS | DC | 20k |
111 | Alice | 5k | Phy | Phy | 30k |
222 | Bob | 4k | CS | DC | 20k |
333 | Carl | 3k | CS | DC | 20k |
iID (name, salary)
depName (bldg, budget)
Greedy Top-Down BCNF Algorithm:
Input:
rels =
- Find a FD violating BCNF in some relation rels
- , ( is ). rels rels
- Repeat until no such FD exists
Note 1: Need to consider and just , because there can be non-trivial direct or implied FDs to consider
Note 2: BCNF decompositions are not unique
Note 3: Lossless. At each split, is one side of the split and is a key in . Therefore, the join of on is and gives back exactly
Note 4: Note dependency preserving.
DeptAdvisor
FDs
- (sID, depName) iID. In English, you can only have at most one advisor from each faculty.
- iID depName
sID | depName | idID |
---|---|---|
S1 | CS | i1 |
S1 | Phy | I2 |
Can break this down into
iID | depName |
---|---|
i1 | CS |
i2 | Phy |
sID | iID |
---|---|
s1 | i1 |
s1 | i2 |
s2 | i2 |
Can’t tell the functional dependencies from these separated tables.
BCNF is not dependant preserving because of two reasons
- BCNF definition is too strict
- BCNF decomposition algorithm, which is top-down, can break FDs need a new decomposition algorithm
Note that it is fine for some functional dependencies to not be locally preserved as long as they are implied by other locally preserved functional dependencies.
DepAdvisor
FDs
- iID depName
- sID iID
- Transitive: sID depName
sID | iID | depName |
---|
We can break this down into:
iID | depName |
---|
and
sID | iID |
---|
Restriction: Let be a set of FDs in . The restriction of to and is the set of FDs in such that of attribute()
Dependency Preservation: Let be a decomposition of . Let be a set of functional dependencies in . Let be the restrictions of onto , for . Then is dependency preserving if (.
Definition (3NF)
A relation is in 3NF, given and of FDs FDs that are in . One of 3 conditions hold
- is trivial
- is a superkey
- all possibly duplicates attributes, which have , are part of some candidate key.
DeptAdv
FDs
- (sID, depname) iID
- iID depName
sID | depName | iID |
---|---|---|
S1 | CS | 111 |
S1 | Phy | 222 |
S2 | CS | 111 |
This is BCNF because iID depName.
However this is 3NF.
3NF Decomposition Algorithm: Bottom up algorithm
Minimal/Canonical Covers of FDs : is minimal if .
- consists of a single attribute
- There is no Z X such that
Let be a minimal cover of .
For each
Fix 1. If does not contain a relation such that contains a candidate key for add to a relation that consists of a candidate key of .
Fix 2: If there any redundant remove them.
4NF: MVDs are generalizations of FDs
An MVD: holds in if given tuples if and . Then, such that and and and and
iID | dep | |
---|---|---|
111 | alice@gmail | CS |
111 | alice@hotmail | CS |
111 | alice@yahoo | Phy |
111 | alice@gmail | Phy |
Lecture 12
MVDs: An MVD holds in if given some values for ( the set of values that appear in are “conditionally independent” of the values
Claim: Every FD is also an MVD
iid | dept | |
---|---|---|
111 | alice@gmail | CS |
111 | alice@hotmail | Phy |
111 | alice@gmail | Phy |
111 | alice@hotmail | CS |
iid email. BCNF, 4NF
Definition
4NF: is in 4NF, if , is either a key or is trivial.
iid |
---|
iid | dpt |
---|
trivial MVD because is empty.
Entity/Relationship DB Design Model:
Humans think of the world in object-oriented manner instead of flat tables
- Objects/entities/things
- Relationships between objects
- Attributes on relationships and objects
Upshots:
flowchart TD
a[DB Design] --> b[DB Logical Record Model]
b --> c[Physical Record Design]
flowchart TD
a[E/R] --- b[Relations]
a --- c[Network Records]
Ingredients of the E/R Model: Relationships between entity sets are always explicitly drawn
- Entity Sets
- Relationship sets
- Attributes on entities and relationships
Attributes are as ovals from the groups.
-ary relationships:
Rules:
Primary Keys: For entities are explicitly defined by underlining some attributes.
For relationship sets: are the union of the Primary Keys of the participating attribute sets.
Consequence: Cannot have parallel relationships between two objects/entities.
The “trick” to model parallel relationships is to turn them into entity sets (reification).
Multiplicities of binary relationship sets
Many-to-many:
Each can be related to different ‘s. Each can be related to different ‘s.
Many-to-one:
Each can be related to or . Each can be related to ‘s.
Each user can be part of at most one group.
One-to-one:
Each can be related to or ‘s and vice versa.
In the original E/R model, there is a general multiplicity constraint.
Each User can be in 2-5 groups. Each Group can have between 1-100 Users.
Weak Entity Sets
Primary key for isIn
: (bID, rNo) is already the same primary key as Rooms.
Option 1: Remove bID
from Rooms not ideal because need a new primary key attribute for Rooms.
Option 2: Remove isIn
completely not ideal because it makes the relationship between Rooms and Buildings implicit.
Weak Entity Set: Captures such scenarios, when a “weak” entity set has an identifier as dominant entity set it depends on.
Primary key of a weak entity set is the union of attributes of its dominant entity set and the discriminator attributes of the weak entity set.
IsA relationship: The tool to construct class hierarchies.
Aggregations: Tool to link entities to relationships (covered above).
Questions about what are my entities which are weak etc. This is more an art.
Lecture 13
ISA Relationships:
2 constraints:
- Overlapping vs Disjoint
Overlapping: A superclass entity can be a member of multiple subclasses
Disjoint: A superclass entity can be a member of at most 1 subclass
- Partial vs Total
Partial: Every superclass entity does not have to be a member of any subclass entity
Total: Every superclass entity has to be a member of some subclass entity.
Translating ER Diagram into Relational DB Schemas
Not every information in an ER diagram may be translatable. e.g. general multiplicities
Sometimes there are multiple options there is not right/wrong answer, there are tradeoffs
(Strong) Entity Set (ES): Direct translation, where each strong ES is its own relation with the same attributes and primary keys. For composite attributes, take the lowest-level attribute.
Relationship Sets (RS): IsMember(uid, gid, fromDate)
(PK is {uid, gid})
By default (when each participating entity set is in a many-to-many multiplicity), a new table with attributes: PKs of participating entity sets any attributes of the relationship set
PK: PKs of the participating entity set
FKs: To the PKs of participating entity sets
However, if one of the participating Entity Sets () has a multiplicity then you can simplify the PK of IsMember
to just the PK of .
Users(uid*, name)
or can do Users(uid, name, gid, fromDate)
Group(gid*, gname)
IsMember(uid*, gid, fromDate)
* Primary key
Roles:
Person(pid, name)
ParentOf(parent_id, child_pid)
Only affect the names of the attributes in the translated relation.
Weak ES (WES)
Make them a table of their own
- Attributes: PK of its dominating ES its differentiator attribute any other attrbiute
- PK: PK of its dominating ES its differentiators
- FK: FKs to the PK of its dominating ES
Building(name*, year)
Rooms(b_name*, rnumber*, capacity)
FK between b_name
in Rooms
and name
in Building
Seats(b_name*, rnumber*, snumber*, size)
(differentiator is snumber
)
FK between b_name, rnumber
in Seats
to b_name, rnumber
in Rooms
Question
How about supporting relationships?
We can forget about them.
ISA Relationships: Upshots: There are multiple options, each offering different tradeoffs. Use your judgement but some options may not work in some cases.
Option 1: Entity-in-all-superclasses:
Each entity set translates to a separate table. Info about a single entity can appear in multiple tables.
Students(sid*, name)
Undergrads(sid*, major)
Grads(sid*, advisor)
Option 2: Entity-in-most-specific subclass:
Only have tables for most-specific child subclass. can only work in Total ISA relationships.
Undergrads(sid*, name, major)
Grads(sid*, name, advisor)
Pro: An entity’s attributes are always in one table (if the relationship is disjoint)
However, not a very elegant solution. Removes the abstraction of a Student.
Option 3: Single top-most table:
Students(sid*, name, major, advisor)
Pro: simple in terms of number of relations in the schema.
Cons: Setting yourself up for NULL
values (don’t want this)
We can be between these three options:
Students(sid*)
Undergrads(sid*, name, major)
Grads(sid*, name, advisor)
Aggregation
Question
How to translate to a table?
You don’t translate into its own table.
Question
How to translate
eval_for
RS?
Same as translating any RS except for the PK of just use the PK of projGuide
eval_for(eID*, iID*, pID*, sID*, rating)
Lecture 14
High-level DBMS Architecture
Transformation of Query’s structure
- Text
- An in-memory representation
- Logical plan
- Physical plan
The parser takes the query and turns it into an abstract syntax tree based on SQL grammar. Performs syntactic check. This is similar to the process in CS241.
We take this element of parsing into a tree from linguistics.
After we create the tree, we bind.
The binder binds type information to columns and expressions. Uses a system component called Database Catalog. This is where we can see type errors.
We then move to a normalizer. This converts comparisons, performs constant arithmetic, simplifies boolean expression etc.
Then moves to the logical plan translator. This maps the normalized abstract syntax tree to an initial unoptimized logical query plan. These are high-level operator.
flowchart TD
a[Project cid] --- b[Filter name=BookA]
b --- c[Join C.cid=O.cid]
c --- d[Join P.pid=O.pid]
c --- e[Scan Customer Table]
d --- f[Scan Table Product]
d --- g[Scan Table Order]
The optimizer then takes this initial plan and uses a set of rules and estimated cost to optimize. For example we may pushdown the filter.
flowchart TD
a[Project cid] --- b[Join C.cid=O.cid]
b --- c[Join C.cid=O.cid]
b --- d[Filter name=BookA]
c --- e[Scan Table Order]
c --- f[Scan Table Customer]
d --- g[Scan Table Product]
Next step is the physical plan translator. This translates the logical operators into physical operators. Logical operators are placeholders to describe overall algorithm, and physical operators are classes in the implementation language of the DBMS that have functions that operate on tuples.
flowchart TD
a[Project cid] --- b[HashJoin L.$3=R.$3]
b --- c[HashJoin L.$3=R.2]
b --- d[IndexScan Product name=BookA]
c --- e[FileScan Order]
c --- f[FileScan Customer]
The query executor then executes physical plans. Could be single threaded but all modern systems parallelize queries. They implement a task scheduling mechanism. A pool of worker threads execute plans in parallel and coordinate.
Storage Manager and Physical Organization Designs
DBMS store data persistently on durable storage devices. Hard disks or flash disks or more recently non-volatile memory. Hence our drawing of DBMSs as disks.
But query processing is computation and all computation ultimately happens in memory.
RDBMSs store data in table/index files on disk. Several physical operators access data in files. Most importantly scans but sometimes also joins.
File accesses happen via classes implemented in StorageManager component.
Physical Operator/Storage interface
Class ScanTable : public PhysicalOperator {
string tableFilename;
DBFileHandle* dbFH; // Interface
int nextRecordID;
// Assume parent operator has access to outTuple as well
Tuple* outTuple;
void init() {
dbFH = new DBFileHandle(tableFilename); // e.g. order.db
nextRecordID = 0;
}
// each op in the tree calls its child to get next Tuple
Tuple* getNextTuple() {
if (nextRecordId < dbFH.maxRecordID) {
dbFH.readRecord(&outTuple, nextRecordID);
nextRecordID++;
return &tuple;
}
}}
Functions such as readRecord
in DBFileHandle
-like classes that form the operator/storage interface know about physical data organizations.
- Record layout in files
- Page/Block layout of records in files
Question
Why is there another unit of data called page/block?
Question
Potential goals of DBMS implementors when designing physical layouts?
Question
Common record and page layouts?
Potential goals of physical data design
- Minimize I/O
- Pack records into few number of pages
- Keep related data close in storage devices
- Cache pages of data that is likely to be used
- Maximize sequential reads/writes from storage devices
- Simplicity/speed of accessing values
Operating System reads/writes data from and to disk in pages. DBMS should also read/write data in pages.
Random I/O is slow, sequential I/O is much faster. Reading 1GB of 64bit ints randomly in RAM vs sequentially and you might get ~80x difference.
Question
Why is Random I/O Slow on Disks?
Time to read bytes: Seek time + rotational delay + transfer time. Explained further in What Every Programmer Should Know About Memory
Lecture 15
Thank you to https://eugenezhang.me/ for Lecture 15 notes.
Row-Oriented -ary Storage Model (NSM)
We assume:
- Each page is of fixed size (4, 8, 16KB)
- Each record consists of fixed size fields of total size
- Assume no compression, or if there is compression, does not isolate fixed size assumption
Customer
cid | name | isGoldMember | |
---|---|---|---|
uint64 | varchar(60) | bool | |
8 bytes | 60 bytes | 1 byte | bytes |
The advantage of making each field the same size in the design is we can use some arithmetic to find the beginning of each record.
Pros: Good for queries that access/write all or many fields
SELECT * FROM Customer WHERE cID = 203
INSERT INTO Customer VALUES (505, 'Dan', 1) -- only one I/O needed, good throughput
Cons: Queries that access/write few fields force you to read entire table, with valid scans (since only bytes used to check isGoldMember
)
SELECT Count(*) FROM Customer WHERE isGoldMember = true
DBMS have a buffer manager (BM) component that keeps a set of file pages in memory to reduce I/O.
Goal of BM: keep pages that are likely to be accessed in memory (i.e. LRU).
Column Record Design
cid
101 | 203 | 107 |
---|
Name columns
Alice1011 | Bob1010 | Carl1011 |
---|
isGoldMember columns
1 | 0 | 1 |
---|
Pros:
- Good for queries that access few columns
- Easier to apply compression
- Easier to insert/remove columns from records
For SELECT *
, must do 3 I/Os (cid
, name
, isGoldMember
)
Cons: Queries that access many columns
Variable-Length Fields
More complex than fixed-length fields. This assume row-oriented NSM storage.
Option 1: Delimiters
Assume records only have 1 variable length field.
If there are multiple variable length fields.
Option 2: Overflow fields, turn variable length fields into fixed length fields (8 bytes) pointers to an overflow.
Every field is fixed length (so no page level offsets) but variable length fields are pointers.
Question
So what do we do with
NULLS
?
Several options, but the common strategy is to store null bits.
Lecture 16
Indices: Primary mechanism to provide sorted access to records in a relation. Speed up lookups and range queries but several other queries as well. They are persistent data structures stored alongside the table files to speed up search.
WHERE R.A = v -- lookup on v
WHERE R.A > 50
WHERE R.A > 50 AND R.A < 100
SELECT X
FROM R, S -- merge join operations that access data through indices
WHERE R.A = S.B -- could benefit from indices on R.A and S.B
SELECT age, avg(salary)
FROM Employee
GROUP BY on age
Index on (salary, age)
. (Salary is the major sum, age is the minor sum). Or (age, salary)
(assuming row-oriented DBMS).
Upshot: 5 designs that progressively improve on each other and lead to trees.
Question
How can you provide sorted access to the tuples of a relation?
Naïve Approach: Keep the relation pages in sorted order and seven sequentially on disk based on shifting.
Con: Very expensive in terms of I/O to keep pages sorted.
2nd Approach: Single-level dense index
Relation pages are not sorted. A dense index level pages are sorted and maintained by shifting. Dense every key appears in the index.
3rd Approach: Single-level sparse index with overflows.
Could be a practical implementation if the application bulk ingests as lot of data and the relations remain relatively stable.
Upon insertions we put records into overflow pages.
There are two problems.
Problem 1: Index pages can still be quite large.
Problem 2: If the assumptions fail, can create very long overflow pages that make the approach degenerate to the naïve approach.
4th Approach: Multiple levels of sparse indices.
In multi-level sparse indices, instead of (key, ptr), (key, ptr), we can store sparse index pages as
trees (5th Approach)
Clustered: The dense layer contains relation pages in sorted order. Unclustered: The dense layer is dense index larger on top of
Common properties of both clustered and unclustered:
- -ary balance tree
- pointers that are kept in the node
- multi-level sparse indices on top of some dense layer.
Clustered tree
The relation pages are not necessarily stored consecutively on disk. That’s why they are changed.
Whereas unclustered tree has (starting from bottom) relation pages, dense index levels, sparse levels.
Lecture 17
Difference between and tree
-tree stores pointers to records in non-leaf nodes too (and does not store these search keys in other non-leaf or leaf nodes).
Pro: These records can be accessed with fewer I/O’s.
Cons: Storing more data in nodes decrease fan-out and increase . Records in leaves require more I/O’s to access. Vast majority of the records live in leaves.
CREATE INDEX Employee Age on Employee age
Creates an unclustered tree index.
RDBMS either provide:
- No support clustering records based on the index (Postgres)
- Automatically cluster a table based on its PK index, which is built automatically (SQL Server and MySQL)
Postgres has a CLUSTER command: clusters pages according to an index, but clustering is not maintained upon updates.
Takeaway: No way to create explicit “clustered indices” except in some systems on the PKs of a table.
Every index created with CREATE INDEX
command is unclustered.
Primary keys are clustered, secondary indices are unclustered.
There are two classes of indices overall.
Tree-based: Can do both lookups and range queries.
Examples include trees, trees, Radix trees.
Hash-based: Can only do look ups. Cannot do range queries. In practice: handle collisions.
Query Processing
SELECT R.X
FROM R, S
WHERE R.A = S.B
QP Paradigms:
Interpreted vs Compiled
Materialized vs Pipelined
Materialized: All ops are “blocking”. Materialize all their inputs to disk. Simple to implement but now obsolete.
Pipelined: When possible, ops take 1 or more tuples-at-a-time, process, and pass to parent ops. This is much more efficient but not always possible. This leverages hierarchies of memory, fast access to the same data.
Volcano vs Block-at-a-time
Query Processing: Algs
I/O Model of Computation: cost metric is I/O and not runtime.
# tuples in .
# of blocks of .
individual tuples in .
# available memory blocks.
Setting is .
Criticisms
- Does not capture runtime
- Does not capture the difference between sequential and random I/O (for simplicity)
Full Table Scan
Can be used as the only operator for simple queries.
SELECT R.X
FROM R
WHERE R.A >= 10 or R.X <= 50
I/O cost .
Index-scan cost depends on many factors. Will discuss later, but just a lookup on R.A = x
when A
is a key when there is an index on R
. Is I/O where is the height if the tree.
Nested Loop Join
One of the most generic join algorithms that can evaluate arbitrary theta joins
for each : for each : for each : for each : check if pass
is called the outer table; is called the inner table.
I/O’s:
We can improve upon this with a Block-Nested-loop join operator.
Block Nested Loop Join
for each : for each cartesian product of and check if those tuples pass
I/O’s:
Uses 3 blocks of memory and “wastes” blocks.
Can optimize by scanning many blocks and streaming blocks. Read into memory as many blocks of as possible, stream by one-block at a time, and join every tuples with all tuples in memory.
I/O’s:
Lecture 18
At a high-level, majority of DBMS operators require sorting or hashing of input tables. If you have a DBMS that has these core operators implemented a solid query processing foundation.
Merge Sort Operator
Sort progressively larger runs, 2, 4, 8, , by merging consecutive “runs”.
Phase 0: Read blocks of at a time, sort them, and write out a level-0 run
Phase 1: Merge level-0 runs at a time, and write out a level-1 run
Phase 2: Merge level-1 runs at a time, and write out a level-2 run.
Final phase produces one sorted run
I/O Cost Analysis
Phase 0: I/Os. There are level-0 sorted runs.
Phase : Merge level- runs at a time, and write out a level- run: I/Os as well.
Total I/O: number of phases. The number of phases:
Total I/O:
As increases, cost decreases.
ORDER BY
, Set union, difference, intersection, and Join all use sorting.
Sort-merge Join
Sort and b their join attributes; then merge. The I/O’s depend on how many tuples match. In the common case, sorting , but in the worst case
Hashing-based Operators
Hashing is a common primitive used in many operators (e.g. DISTINCT
).
Hash Join: For equality joins:
Observation: Let be a hash function mapping values to . Then if and are such that , then
Question
Why is this useful?
We can partition by hashing its tuples on into
We can partition by hashing its tuples on into
for each pull and into memory and run an in-memory join algorithm. If we use an in-memory hash join as well, the system will pick the smaller table as hash build size and the larger one as probe side.
Analysis: If and fit into memory: Total cost excluding output writing is
If the assumption fails and , do not fit into memory, systems try to repartition with another has function or they fail and default to another join algorithm.
Query optimization: Optimizer is the “brain” of a DBMS
Given a query
SELECT cid
FROM Cust c, Order o, Product p
WHERE c.cid = o.cid AND o.pid = p.pid AND p.name = 'BookA'
Many logical plans correctly evaluate.
flowchart TD
a[Project c.cid] --- b[Filter p.name='bookA']
b --- c[Join o.cid = c.cid]
c --- d[Join o.pid = p.pid]
d --- e[Product]
d --- f[Order]
c --- g[Customer]
flowchart TD
a[Project c.cid] --- b[Join o.pid = p.pid]
b --- c[Join o.cid = c.cid]
b --- d[Filter p.name='bookA']
c --- e[Order]
c --- f[Customer]
d --- g[Product]
Each can be mapped to multiple physical plan . The goal is to pick the best plan.
General optimization paradigms:
- Mix of rule-based/cost-based optimization/transformations
- Rule-based mapping
Rule-based transformations: Rules that transform Exp Exp based on relational algebraic laws that guarantee that Exp is “equivalent” to Exp (i.e. Exp always gives the same answers as Exp, often as db-agnostic law)
R1:
R2: Merging/Splitting consecutive ‘s
R3: Merging/Splitting of ‘s
R4: Pushing down filters :
Cannot always push through any operation.
For example: Outer Join
but
Lecture 19
R5: Pushing down projections
where contains attributes of accessed by that are not already in .
Example
Cost-based Optimization: 3 components of cost-based optimizations.
- An algorithm to enumerate plans (specifically different join orders)
For example, dynamic programming or greedy based algorithms.
- Cost function (cost): Takes a plan and outputs a number as a proxy for how expensive is
We usually assume that
- Cardinality estimation function: Given estimated inputs to an operator, estimates the output of the operator part of any cost function.
Given each enumerated logical/physical plan, the cost is the estimate of the system for how good/bad is. We pick the minimum cost plan.
Naturally: cost definition is broken into cost operators.
i.e. Cost() = cost()
Example cost metrics or components:
- # I/Os a plan will make
- # tuples that will pass through operators
- # runtime of algorithm is running
- Combination of above
For any reasonable metric, we need to estimate cardinality. This is a notoriously difficult problem.
Question
How do we estimate cardinality of tables processed by each op?
We need a cardinality estimation technique to estimate cardinality.
Given a database
- A sub-query (a relational algebra expression). What is the ?
E.g.:
- ?
- ?
Widely adopted join order optimizer:
Recall:
Enumerate a logical plan space
A widely used optimization algorithm is to use dynamic programming CS341.
Restrictions and Assumptions:
is a “binary chain query”, i.e. looks like:
Restrict to not contain cartesian products.
Thought experiment: Let be the lowest cost plan to join
There are several possible high-level structures can have:
Case 1:
Case 2:
Case :
Claim: If splits at , then that it uses must be the lowest cost plan to join because of the “additive” nature of the cost function we are assuming. If this were not true, we could replace with the optimal plan for which is a contradiction (since is optimal).
Algorithm outline:
Base case: optimal plan to join .
Let be the optimal plan to join such that
Lecture 20
Example Statistics-based Estimation Techniques
Q:
Suppose the following information is available
Size of : and number of distinct values in :
Assumptions:
- Values of are uniformly distributed in
Selectivity factor of is
Example:
:
Q:
Additional assumption: 3. and are independent
Reduce total size by all selectivity factors. Directly derived from standard probability rules STAT230.
Q:
Selectivity factor of is (1 - selectivity factor of )
Q:
NO! Tuples satisfying and are counted twice.
Case 1: Suppose the DBMS knew actual projection values:
Then range queries are a generalization of
Case 2: We don’t know the actual values.
Then there is not enough information, we can just pick a magic constant, e.g.,
With more information:
- Largest value:
- Smallest value:
In practice, sometimes the second highest and lowest are used.
Q:
Assumption: Containment of value sets:
Every tuple in the smaller relation (one with fewer distinct values for the join attribute) joins with a tuple in the other relation.
That is, if then
Not true in general, but holds in the common case of foreign key joins.
Transactions: Mechanism to obtain separate guarantees from the DBMS.
- Atomicity and Durability: all-or-nothing visibility and durability guarantee when users want to modify multiple data elements at the same time.
Example
Bank money transfer from account1 to account2
UPDATE Accounts SET balance=balance-100
WHERE aID = 'acc1'
UPDATE Accounts SET balance=balance+100
WHERE aID = 'acc2'
SELECT * FROM Accounts
WHERE aID='acc1' or aID='acc2'
We want to make sure either both of them happened, or none of them happened.
BEGIN TRANSACTION
UPDATE Accounts SET balance=balance-100
WHERE aID = 'acc1'
UPDATE Accounts SET balance=balance+100
WHERE aID = 'acc2'
COMMIT TRANSACTION
--now an atomic unit
SELECT * FROM Accounts
WHERE aID='acc1' or aID='acc2'
Can need the same guarantee even if you issue a single SQL statement
UPDATE Customer
SET membership='Gold'
WHERE CID IN (SELECT *
FROM Orders
WHERE price >= 20)
-- don't want to favour some customers
- Isolation under concurrency: users want two features when developing apps that will run concurrently on the same database. When there are multiple transactions running, , we want the database state at the end to be “equivalent” to as if ran in some serial order.
- They want “genuine” concurrency. They don’t want other apps blocking them
- They want to write apps “under the illusion” that their app is the only app accessing the database
User 1
UPDATE balance=balance-100
UPDATE balance=balance+100
User 2
SELECT *
FROM Accts
WHERE aID = 'acct1' or aid = 'acct2'
aID | bal |
---|---|
acc1 | 1000 |
acc2 | 500 |
or
aid | bal |
---|---|
acc1 | 900 |
acc2 | 600 |
A simple model of transactions.
Simplifying assumptions:
- No failures or aborts
- Just updates (no inserts or deletes)
- No SQL simple read and write operations
- Concurrency is achieved by context switching CS350.
Ingredients:
- DB consists of data items () that are stored on disk (source of truth)
- 2 operations:
- read(): copies the disk version into a local buffer.
- write(): writes its local copy back to disk
- Before writes , we will assume that it has an arbitrary computation on before writing (and we cannot predict this value)
Note: can write to without reading.
read(x) |
x=100 |
read(y) |
x=y-50 |
write(x) |
We will omit and .
This table becomes read(), read(), write()
Execution history of a set of transactions
Example
The above is a possible interleaved ordering of the ops of without re-ordering any ops within any
This is not valid.
Question
Given an execution history , is it “equivalent” to a serial execution history ?
Lecture 21
Possible definitions of equivalence:
Option 1: If the record in and are guaranteed to be the same and the order of the writes to any data item in and are also the same.
Option 2: Same read guarantee, but only the final writes need to be the same.
Option 3: Only the final writes are the same
We will go with option 1.
Serial history: No interleaving of operations from different transactions
Initially
(not serial since operations interleave between transactions)
(x=4) | |
(x=4) | |
(y=5) | |
(y=5) |
(serial)
(x=4) | |
(y=5) | |
(x=4) | |
(y=5) |
is equivalent to if values read by each in and is the same, and the order of write operations on each item is the same.
Conflict: Two operations and conflict if
- , belong to different transactions
- and operate on the same data item
- One of them is a write
Therefore there are two types of conflicts: Read-write, and write-write
Conflict equivalence: Two instances and are “conflict equivalent” if
- They contain the same transactions
- The ordering of each pair of conflicting operations is the same in and
In , and conflict, and and conflict.
is conflict equivalent to . Thus, is conflict serializable (because they operate over the same set of transactions and all conflicting pairs are in the same order).
Theorem/Intuition
If and are conflict equivalent, then we can take and through a sequence of swaps of consecutive non-conflicting operations, make it exactly equal to
and
Conflict serializability: An execution history is conflict serializable if is conflict equivalent to some serial execution of the same transactions in .
Question
How to check if an execution history is serializable?
Check that it is conflict equivalent to any serial executions.
is not conflict serializable. This table is an easy way to check serializability of the number of transactions are small.
More practical tests of (conflict) serializability: Serialization graph
:
and
Intuition: Given , if is conflict serializable, and exists in , then in that serial order that is equivalent to, must execute before .
Theorem
If is acyclic, then is (conflict) serializable because can be made equivalent to any topological order of
Question
How does a system ensure that it generates serializable execution histories?
Locking CS350.
Rule:
- To read a data item , a transaction must update a shared lock on
- To write to , transactions need to obtain an exclusive lock lock-
Y | N | |
N | N |
Naïve locking: grab a lock on , right before the first operation on , and release immediately after the last operation.
Fail
This does not work.
acquire lock | |
unlock | |
acquire lock | |
unlock | |
acquire lock | |
unlock | |
acquire lock | |
unlock |
Wrapping each operation with a lock does not solve our issue.
2PL: All lock requests of must happen before all unlock requests of .
Theorem
Strict 2 Phase Locking guarantees that if there are no deadlocks, then the generated execution history is serializable.
Strict phase locking all exclusive lock releases happen at commit time.
lock | |
lock(y) | lock(x) - blocked |
unlock(x) | |
unlock(y) | |
lock(y) | |
commit | |
commit |
lock | |
lock | |
lock blocked | |
lock blocked |
This is a deadlock.
Lecture 22
Guaranteeing serializability has performance overheads.
Question
Why does 2PL lead to serializable exceution histories?
Claim: is conflict equivalent to executing the transactions serially in the order of the lock points of transactions.
Why? For the operations of , before the lock part of , any such operation of cannot conflict with ops from or because all transactions are in their locking phases, so and would have blocked ‘s operations.
For operations of in unlock phase, similar argument, that any operation of in the unlock phase holds the necessary lock needed by .
Another observation: The claim about 2PL is not that every serializable history can be generated by some execution of 2PL.
lock- | |
lock- (blocked) | |
SQL Transaction isolation levels:
Serializability is the strongest guarantee a DBMS can give. This strength comes at a performance cost.
SQL offers weaker isolation levels for transactions:
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
Every BEGIN TRANSACTION
command begins at repeatable read level.
Isolation Levels in SQL Standard |
---|
Read Uncommitted |
Read Committed |
Repeatable Read |
Serializable |
As we go down, there are stronger isolation guarantees. As we go up, performance goes up.
Isolation levels are per transactions.
Question
What does it mean for to be serializable/read uncommitted etc?
Each transaction gets a per transaction guarantee for the read operations it makes
For serializability in particular: You can assume the following happened. If is set to serializable, it means some set of transactions executed before (and not necessarily in a serializable manner) and left the database in a state , then you can assume as a user that ran complete in isolation.
READ UNCOMITTED
: Can read dirty data, item written by an uncommitted transactions.
READ COMMITTED
: Does not read dirty data, but reads of same item may not be repeatable (if commit happens in and transaction reads same again)
REPEATABLE READ
: No unrepeatable reads, but phantom reads may appear (insert/deletion commit in , same queries in may differ).
SERIALIZABLE
: No dirty reads, every read is repeatable, no scan of any relation can be phantom.
Atomicity “nothing” behaviour if a transaction fails or roll back
Consistency if every transaction is written such that if is a consistent database then leads to a consistent , then serializability guarantees that remains consistent under concurrency.
Isolation serializability is supported in the system.
Durability “all” behaviour if a transactions commits successfully, then all operations in the transaction will be reflected in the database even if the system crashes.
Atomicity and Durability are achieved through logging and handled by recovery manager.
abort |
Disk |
---|
Log |
---|
<, start> |
<> |
< commit/abort> |
Need a log to bookkeep the changes that transactions are making (need mechanism to undo operation). Logs have to be conservative i.e. write-ahead, before data items are changed on disk.
Lecture 23
Write-ahead logging: Log of any update on must be written before the update on on disk
Log Records |
---|
< start> |
(Update records) < prevVal, newVal> |
(Compensation record) <, oldval> |
<, abort/commit> |
Log |
---|
<, start> |
<, 800, 700 |
<, start> |
<, 1000, 1200> |
CRASH |
Disk | |
---|---|
A | |
C | 1000 |
Undo Algorithm: For any data item , that has been updated by an uncommitted transaction, undo the updates.
Log |
---|
<, start> |
<, B, 50, 60> |
<, commit> |
CRASH |
Disk | |
---|---|
B |
Force-policy: any update happens directly to disk version of data items.
No-Force policy: DBMS first writes changes to in-mem copy of pages, and later to disk.
Let be the set of uncommitted and aborted transactions.
Scan log in reverse order from the end
for each log in reverse order:
if is an update record <, prevVal, newVal> such that ,
- Write a compensation record: < prevVal>
- Update to prevVal on disk
else if is <, start> and
- Write <, abort> to the log
- Remove from
Log |
---|
<, start> |
<> |
<, start> |
<> |
CRASH |
<> |
<> |
<, abort> |
<, 800> |
<, abort> |
Disk | |
---|---|
A | |
C |
Redo must happen first in the forward direction. Then we do undo in the backward direction.
Checkpointing shortens the amount of log that needs to be undone or redone when a failure occurs.
Courses to take after CS348
- CS448 database implementation (programming heavy)
- CS451 more on data systems
Question
What is RDF?
First, motivation.
Example
Amazon product catalogue
An apple watch is a smart watch, watch, wearable, electronic.
A normal watch is a wearable and watch, but not electronic or smart.
We need a more object oriented/graph-based model. This will include type hierarchies.
RDF Component 1: Resources and International Resource Identifiers (IRI’s)
Resources are objects to represent entities, types and properties.
All objects have an IRI.
RDF Component 3: Triples
Subject, verb, predicate, object, sentences
(Levis511)(is a type of)(LooseJeans) follows the format: subject verb object
<gc: Levis511, rdf:type, gc:LooseJeans>
Everything is a triple in RDF.
We can model irregular properties across resources (good flexibility).
SELECT ?p
WHERE {
?p rdf:type gc:Jeans
}
Inference rule: rdf:type and rdf: subclass of rdf:type
OWL inference example
select ?m
where {
gc:Levis511 md:merchant ?m
}