I am learning about join algorithms in regards to relational data query processing. The simple case is the nested loop join:
function nestedJoin(R, S, compare) {
const out = []
for (const r of R) {
for (const s of S) {
if (compare(r, s)) {
out.push([ r, s ])
}
}
}
return out
}
Where compare
would compare the join attribute.
The case I'm wondering about is the index join. Copying from that cheat sheet into JS sort of, we have:
function indexJoin(R, S) {
const out = []
for (const r of R) {
const X = findInIndex(S.C, r.c)
for (const s of X) {
out.push([ r, s ])
}
}
return out
}
But what is that findInIndex(S.C, r.c)
? What is being passed into it (S.C
)? And how does it work?
The join indices paper says this:
With no indices, two basic algorithms based on sorting [4] and hashing [5, lo] avoid the prohibitive cost of the nested loop method.
A join index is a binary relation. It only contains pairs of surrogates which makes ( it small. However, for generality, we assume that it does not always fit in RAM. Therefore, a join index must be clustered. Since we may need fast access to JI tuples via either r values or s values depending on whether there are selects on relations R or S, a JI should be clustered on (r, s). A simple and uniform solution is to maintain two copies of the JI, one clustered on r and the other clustered on s. Each copy is implemented by a W-tree an efficient variation of the versatile B-tree [l, 71.
So if it were a B+tree, what would the key and value be used in each B+tree, and how would you use the keys and values (in what order do you plugin keys and get values)? Also, cannot the "join index" just be implemented something like this if it were in JavaScript?
const joinIndex = {}
function join(r, s) {
const rest = joinIndex[r.c] = joinIndex[r.c] ?? {}
rest[s.c] = true
}
function findInIndex(leftKey) {
return Object.keys(joinIndex[leftKey])
}
Please show how the join algorithm would be implemented, either with my approach or the B+tree approach. If it is the B+tree approach, you don't need to implement a B+tree, but just explain how you would plug things in/out of the 2 B+trees to explain the algorithm with more clarity.