-4

According to this SO answer, we have:

SELECT DISTINCT s.id
FROM students s
LEFT JOIN grades g ON g.student_id = s.id
WHERE g.student_id IS NOT NULL

Which uses a LEFT JOIN. But it can be implemented with a "semijoin" like this:

SELECT s.id
FROM  students s
WHERE EXISTS (SELECT 1 FROM grades g
              WHERE g.student_id = s.id)

How do you write these two queries in JavaScript, given two collections students and grades? The first one I think is implemented with a "nested loop join" (or could be optimized with an index join data structure, but I don't know how to implement that just yet).

const query1Results = nestedJoin(students, grades, (s, g) => {
  return Boolean(g.student_id && g.student_id === s.id)
}).map(([s]) => s.id)

function nestedJoin(R, S, compare) {
  const out = []
  
  for (const r of R) {
    for (const s of S) {
      if (compare(r, s)) {
        out.push([ r, s ])
      }
    }
  }

  return out
}

How would you do the second one though?

const query2Results = semiJoin(students, grades, (s, g) => {
  return g.student_id === s.id
}).map(([s]) => s.id)

function semiJoin(R, S) {
  const out = []
  
  for (const r of R) {
    for (const s of S) {
      if (compare(r, s)) {
        out.push([ r, s ])
      }
    }
  }

  return out
}

Basically, I implemented it the same way.

const students = [
  { id: 1 },
  { id: 2 },
  { id: 3 },
  { id: 4 },
  { id: 5 },
  { id: 6 },
]
const grades = [
  { id: 1, student_id: 1 },
  { id: 2, student_id: 1 },
  { id: 3, student_id: 1 },
  { id: 4, student_id: 1 },
  { id: 5, student_id: 1 },
  { id: 6, student_id: 1 },
  { id: 10, student_id: 2 },
  { id: 20, student_id: 2 },
  { id: 30, student_id: 2 },
  { id: 40, student_id: 2 },
  { id: 50, student_id: 2 },
  { id: 60, student_id: 2 },
  { id: 100, student_id: 3 },
  { id: 200, student_id: 3 },
  { id: 300, student_id: 3 },
  { id: 400, student_id: 3 },
  { id: 500, student_id: 3 },
  { id: 600, student_id: 3 },
]

const expectedOutput = [
  1, 2, 3
]
Lance
  • 75,200
  • 93
  • 289
  • 503
  • 2
    May not be related to your problem, but `WHERE g.student_id IS NOT NULL` defeats the purpose of the left outer join and makes it essentially an inner join in disguise. – sticky bit Jan 23 '22 at 22:28
  • 1
    please add the ddt to the question. do you have some data as well? what goes wron with your code? – Nina Scholz Jan 23 '22 at 22:29
  • All I would like to know is, what is a semijoin in reality. – Lance Jan 24 '22 at 00:45
  • 3
    "All I would like to know is, what is a semijoin in reality." Unclear, and if so why isn't it the question in your post? "write these two queries in JavaScript" & "implement" & "do the" Unclear. If the queries return the same function of input, why do you need multiple JS versions? Again, what exactly is the 1 (clear, specific researched non-duplicate) question? What characterizes an answer to your post? PS Please clarify via edits, not comments. PS x left join y on c where y.a is not null is x inner join y on c. Left join is unneeded & misleading in the 1st query. PS [mre] PS `@` – philipxy Jan 24 '22 at 23:17

2 Answers2

1

Some preliminary remarks:

  • The first query has a NOT NULL condition in its WHERE clause which you haven't specifically translated in your JS code. In fact, such a condition annihilates the special effect of the outer join, making it act as an inner join. So really the first query is:

    SELECT DISTINCT s.id
    FROM students s
    INNER JOIN grades g ON g.student_id = s.id
    
  • The desired output should really be a 2-dimensional array, because in general an SQL statement can select multiple expressions. So each output row could be an array in itself. In this case, I would expect the output to be [[1], [2], [3]]

  • It is not really possible to have an exact translation of how a query is executed by a database engine, because database engines typically inspect tables and indexes to decide how exactly the query will be executed, and some of this may happen in real time, meaning that the same query may even be executed differently when the number of records has changed significantly in one of the involved tables.

    The way the query will be executed (order of accessing the tables, which join to perform first, which index to use or not use, ...) is called the execution plan. This is also in line with the original spirit of the SQL language: it was supposed to let the user say what they want, and not how to do it.

Having said that, this question made me think of popular libraries that allow one to express queries with some hybrid syntax, where SQL clauses can be chained together via method calls.

So for instance, your two queries could then be written like this:

res = select("s.id")
    .distinct()
    .from(students, "s")
    .from(grades, "g")
    .where("g.student_id", "=", "s.id")
    .exec();

And:

res = select("s.id")
    .from(students, "s")
    .whereExists(
        select(1)
        .from(grades, "g")
        .where("g.student_id", "=", "s.id")
    )
    .exec();

Here is an implementation in JavaScript that allows the above expressions to work. It only supports a tiny subset of SQL, just enough to be able to tackle the constructs used in these two SQL statements:

class Query {
    constructor() {
        this.tables = {};
        this.selections = [];
        this.conditions = [];
        this.exists = [];
        this.isDistinct = false;
    }
    static parseField(field) {
        try { // Try to interpret the argument as a literal value (like 1)
            let value = JSON.parse(field);
            return {value};
        } catch(e) { // Otherwise interpret it as table.field
            let [table, column] = field.split(".");
            return {table, column};
        }
    }
    distinct() {
        this.isDistinct = true;
        return this;
    }
    from(records, alias) {
        this.tables[alias] = records;
        return this;
    }
    where(fieldA, operator, fieldB) {
        let {table: tableA, column: columnA} = Query.parseField(fieldA);
        let {table: tableB, column: columnB} = Query.parseField(fieldB);
        this.conditions.push({tableA, columnA, operator, tableB, columnB});
        return this;
    }
    whereExists(subquery) {
        this.exists.push({subquery, not: false});
        return this;
    }
    whereNotExists(subquery) {
        this.exists.push({subquery, not: true});
        return this;
    }
    exec(parentRecords={}) {
        // Perform cartesian product (by lack of index)
        let max = Object.values(this.tables).reduce((size, {length}) => size * length, 1);
        let results = [];
        for (let i = 0; i < max; i++) {
            let k = i;
            // Select a record from each table
            let records = {...parentRecords}; // Allow access to parent query
            for (let alias in this.tables) {
                let recId = k % this.tables[alias].length;
                k = (k - recId) / this.tables[alias].length;
                records[alias] = this.tables[alias][recId];
            }
            // Apply conditions
            let include = this.conditions.every(({tableA, columnA, operator, tableB, columnB}) => {
                let a = records[tableA][columnA];
                let b = records[tableB][columnB];
                if (operator == "=") return a === b;
                if (operator == "<") return a < b;
                if (operator == ">") return a > b;
                // ...etc
            }) && this.exists.every(({subquery, not}) => { // Where (NOT) EXIST
                return not === !subquery.exec(records).length;
            });
            if (include) {
                // Apply SELECT
                results.push(this.selections.map(({value, table, column}) => value ?? records[table][column]));
            }
        }
        // Apply DISTINCT
        if (this.isDistinct) {
            results = [...new Map(results.map(result => [JSON.stringify(result), result])).values()];
        }
        return results;
    }
}

function select(...fields) {
    let query = new Query;
    query.selections = fields.map(Query.parseField);
    return query;
}

// Example run

const students = [
  { id: 1 },
  { id: 2 },
  { id: 3 },
  { id: 4 },
  { id: 5 },
  { id: 6 },
]
const grades = [
  { id: 1, student_id: 1 },
  { id: 2, student_id: 1 },
  { id: 3, student_id: 1 },
  { id: 4, student_id: 1 },
  { id: 5, student_id: 1 },
  { id: 6, student_id: 1 },
  { id: 10, student_id: 2 },
  { id: 20, student_id: 2 },
  { id: 30, student_id: 2 },
  { id: 40, student_id: 2 },
  { id: 50, student_id: 2 },
  { id: 60, student_id: 2 },
  { id: 100, student_id: 3 },
  { id: 200, student_id: 3 },
  { id: 300, student_id: 3 },
  { id: 400, student_id: 3 },
  { id: 500, student_id: 3 },
  { id: 600, student_id: 3 },
]


let res1 = select("s.id")
        .distinct()
        .from(students, "s")
        .from(grades, "g")
        .where("g.student_id", "=", "s.id")
        .exec();

console.log(res1);

let res2 = select("s.id")
        .from(students, "s")
        .whereExists(
            select(1)
            .from(grades, "g")
            .where("g.student_id", "=", "s.id")
        )
        .exec();

console.log(res2);
trincot
  • 317,000
  • 35
  • 244
  • 286
  • So cool! How about, where does semijoin fit in? I was asking this was to figure out what semijoin is, since it is "necessary" in distributed databases. Was trying to see what it looked like in JavaScript. What you did is helpful showing the different ways to select, does it relate to semijoin? – Lance Jan 27 '22 at 01:08
  • 1
    The characteristics of semijoin are used here: it involves another query, that is executed with a given context (`parentRecords`), but does not contribute fields to be selected in the final query. For example, in the first query, you could have included `grades.id` in the `select` clause, which is not possible in the second one (hence: it's "only" a "semi" join). – trincot Jan 27 '22 at 06:58
-1

From the U-SQL definition for semijoin...

Semijoins are U-SQL’s way [to] filter a rowset based on the inclusion of its rows in another rowset

This really does depend on your table design and indexes but assuming you have one on grades.student_id, the JS equivalent might look like

const students = [{"id":1},{"id":2},{"id":3},{"id":4},{"id":5},{"id":6}]
const grades = [{"id":1,"student_id":1},{"id":2,"student_id":1},{"id":3,"student_id":1},{"id":4,"student_id":1},{"id":5,"student_id":1},{"id":6,"student_id":1},{"id":10,"student_id":2},{"id":20,"student_id":2},{"id":30,"student_id":2},{"id":40,"student_id":2},{"id":50,"student_id":2},{"id":60,"student_id":2},{"id":100,"student_id":3},{"id":200,"student_id":3},{"id":300,"student_id":3},{"id":400,"student_id":3},{"id":500,"student_id":3},{"id":600,"student_id":3}]

// index
const idxGradesStudentId = new Set(grades.map(({ student_id }) => student_id))

function semiJoin(R, S) {
  // WHERE EXISTS
  return R.filter(({ id }) => idxGradesStudentId.has(id))
    .map(({ id }) => id) // SELECT
}

console.log(semiJoin(students, grades))

So rather than a nested loop, this uses the index on grades.student_id (represented by a Set) to evaluate the EXISTS clause (Set.prototype.has()) which has the benefit of O(1) lookup time complexity.


Without indexes, you can still perform an exists operation (via Array.prototype.some()) but it will obviously be less performant.

function semiJoin(R, S) {
  return R.filter(({ id }) =>
    S.some(({ student_id }) => student_id === id)
  ).map(({ id }) => id)
}
Phil
  • 157,677
  • 23
  • 242
  • 245
  • How does it differ from the nestedJoin, other than the fact that you used an index (I could have used an index and that would be called an "index join" perhaps, not sure). – Lance Jan 23 '22 at 22:54
  • Do you know how to answer [that index join q](https://stackoverflow.com/questions/70820053/how-to-implement-an-index-join-data-structure-in-javascript)? – Lance Jan 23 '22 at 22:56
  • 1
    It differs by only iterating the `R` collection. There's no nested looping – Phil Jan 23 '22 at 23:00
  • @Lance can you tell me how this answer does not satisfy this statement... _"Semijoins are [a] way filter a rowset based on the inclusion of its rows in another rowset"_ – Phil Jan 24 '22 at 23:24
  • @Lance I've added an example without an index – Phil Jan 25 '22 at 03:38