ensure what you write is correct and PRECISE
"...which is different from most modern commercial database." "A column may be in many projections. So it has K -safety property. "
The bag of tricks
Row vs Col page structure
Row pages (NSM Nary storage model) data top down, and pointers to rows bottom up
[[header] [row 1], [row 2], ... [row 2 ptr] [row 1 ptr]]
Col pages (DSM decomposition storage model)
[[header] [rid1], [val1], ... ] Subrelation for Col 1
PAX: hybrid. Each page contains K rows, internally columnar (for CPU)
10k machines * 100hz = 1,000,000 * # Sensors * # data centers * bulk writes, few updates, read-mostly
unnormalized retailer schema
store_id, store_name, store_owner, store_addr1, ...., emp_id, emp_name, emp_age, emp_salary, ... prod_id, prod_name, prod_id, prod_price, ... ...
normalized “star” schema
fact(store_id, emp_id, prod_id, cust_id, ...) store(store_id, name, owner, addr1, ...) ...
Projection: Multiple columns
sorted on subset of the columns (sort key). Index in sort order is storage key for row.
EMP1(name, age | age) # sorted on age
The columns for EMP1:
(name, storage_key) (age, storage_key) // storage key not stored, just its order number in the column // for joining between projections
Join Indexes match the same records across projections
Suppose I have the following query and projections
SELECT a, b FROM data C1(a | a) # M partitions C2(b | b) # N partitions C1_i is i'th partition of C1 J_i(sid, storage_key) * has same number of rows as C1_i. * sid, storage_key of matching row in C2
Expensive to maintain under updates! Try to avoid by having each col in many projections
few distinct values, sorted on this col
(val, startidx, endidx)
(val, bitmap) e.g., (99, 0111100010100)
10, 12, 13, 14 --> 10,2,1,1 reduce bits per val, then compress again
Tries to avoid building two engines.
two indexes: one on sk, the other on sort key
EMP(age,name|age) btree index on sk --> age btree index on sk --> name btree index on age --> smallest sk of the run
BerkeleyDB + lots of memory
Isn’t storing all these projections blowing up disk costs by several X?
select avg(price) from data where symbol = 'GM' AND date = xxx
read and send entire tuples throughout the plan
AVG price | SELECT date = xxx | SELECT sym = 'GM' | | (GM, 1, ... xx, ...) | [Symbol, price, nshares, exchange, date,...] [GM 1 ... xx ] [GM 2 ... xx ] [GM 3 ... xx ] [AAPL 4 ... xx ] ...
Columnar using row store
due to record overhead, can be slower than vanilla row store
Avg price | SELECT date = xx | SELECT sym = GM | | (GM, 1, xx) | Construct (Symbol, Price, Date) / | \ [Symbol] [price] [nshares] [exchange] [date] ... GM 1 xx GM 2 xx GM 3 xx AAPL 4 xx
CStore w/ late materialization
send compressed bitstrings through query plan
Avg price | (1,2,3) Lookup price | (1,1,1,0) AND (1,1,1,0) / \ (1,1,1,1) SELECT sym = GM SELECT date=xx / \ / \ [Symbol] [price] [nshares] [exchange] [date] ... GM 1 xx GM 2 xx GM 3 xx AAPL 4 xx
CStore w/ compression
Avg price | (1,+1,+1) Lookup price | (3x1, 1x0) AND (3x1,1x0) / \ (4x1) SELECT sym = GM SELECT date=xx / \ / \ [Symbol] [price] [nshares] [exchange] [date] ... 3xGM 1 4x "xx" AAPL +1 +1 +1
100x faster than RDBMS – everyone uses it
Two awesome systems. The x100 CIDR paper is great.
BAT column-at-a-time physical algebra
sum(c * (1 - d)), sum(c * (1 - d) * e) tmp1 = -(1, d) tmp2 = *(c, tmp1) tmp3 = sum(tmp2) tmp4 = *(tmp2, e) tmp5 = sum(tmp4)
consider less than implementation
branching_lt(n, res, col, val): k = 0 for i in 0..n if (col[i] < val) res[k++] = i nonbranching_lt(n, res, col, val): k = 0 for i in 0..n res[k] = i k += (col[i] < val)