Spark replacement for EXISTS and IN

Question

I am trying to run a query that uses the EXIST clause:

select <...>    
  from A, B, C
where
  A.FK_1 = B.PK and
  A.FK_2 = C.PK and
  exists (select A.ID from <subquery 1>) or 
  exists (select A.ID from <subquery 2>)

Unfortunately, this does not seem to be supported. I have also tried replacing the EXISTS clause with an IN clause:

select <...>    
  from A, B, C
where
  A.FK_1 = B.PK and
  A.FK_2 = C.PK and
  A.ID in (select ID from ...) or
  A.ID in (select ID from ...)

Unfortunately, also the IN clause seems to be unsupported.

Any ideas of how I can write a SQL query that achieves the desired result? I could model in principle the WHERE clause as another JOIN and the second OR clause as an UNION but it seems super clumsy..

EDIT: Listing a number of possible solutions.

Solution 1

select <...>    
  from A, B, C
       (select ID from ...) as exist_clause_1,
       (select ID from ...) as exist_clause_2,
where
  A.FK_1 = B.PK and
  A.FK_2 = C.PK and
  A.ID = exist_clause_1.ID or
  A.ID = exist_clause_2.ID

Solution 2

select <...>    
  from A, B, C
       ( (select ID from ...) UNION
         (select ID from ...)
        ) as exist_clause,
where
  A.FK_1 = B.PK and
  A.FK_2 = C.PK and
  A.ID = exist_clause.ID

Shouldn't your very first query template have EXISTS calls of the form `exists (select E.ID from E where E.ID = A.ID))`? — philipxy, Jan 20 '16 at 02:17

score 18 · Accepted Answer · edited May 23 '17 at 10:31

SparkSQL doesn't currently have EXISTS & IN. "(Latest) Spark SQL / DataFrames and Datasets Guide / Supported Hive Features"

EXISTS & IN can always be rewritten using JOIN or LEFT SEMI JOIN. "Although Apache Spark SQL currently does not support IN or EXISTS subqueries, you can efficiently implement the semantics by rewriting queries to use LEFT SEMI JOIN." OR can always be rewritten using UNION. AND NOT can be rewritten using EXCEPT.

A table holds the rows that make some predicate (statement parameterized by column names) true:

The DBA gives the predicates for each base table T with columns T.C,...: T(T.C,...)
A JOIN holds the rows that make the AND of its arguments' predicates true; for a UNION, the OR; for an EXCEPT, the AND NOT.
SELECT DISTINCTkept columnsFROMT holds the rows where EXISTS dropped columns [predicate of T].
TLEFT SEMI JOINU holds the rows where EXISTS U-only columns [predicate of T AND predicate of U].
TWHEREcondition holds the rows where predicate of T AND condition.

(Re querying generally see this answer.)

So by keeping in mind predicate expressions corresponding to SQL you can use straightforward logic rewrite rules to compose and/or reorganize queries. Eg using UNION here need not be "clumsy" either in terms of readability or execution.

Your original question indicated that you understood that you could use UNION and you have edited variants into your question that excise EXISTS and IN from your original queries. Here is another variant also excising OR.

    select <...>    
    from A, B, C, (select ID from ...) as e
    where
      A.FK_1 = B.PK and
      A.FK_2 = C.PK and
      A.ID = e.id
union
    select <...>    
    from A, B, C, (select ID from ...) as e
    where
      A.FK_1 = B.PK and
      A.FK_2 = C.PK and
      A.ID = e.ID

Your Solution 1 does not do what you think it does. If just one of the exists_clause tables are empty, ie even if there are ID matches available in the other, the FROM cross product of tables is empty and no rows are returned. ("An Unintuitive Consequence of SQL Semantics": Chapter 6 The Database Language SQL sidebar page 264 of Database Systems: The Complete Book 2nd Edition.) A FROM is not just introducing names for rows of tables, it is CROSS JOINing and/or OUTER JOINing them after which ON (for INNER JOINs) and WHERE filter some out.

Performance is typically different for different expressions returning the same rows. This depends on DBMS optimization. Many details, which the DBMS and/or programmer may be able to know and if so may or may not know and may or may not best balance, affect the best way to evaluate a query and the best way to write it. But executing two ORed subselects per row in a WHERE (as in your original queries but also your late Solution 2) is not necessarily better than running one UNION of two SELECTs (as in my query).

Thank you for the reply! I ended up using subqueryies for each of the select statements and performing one gigantic join between all the base relations and the relations computed by the subqueries. I think your solution is somewhat different although I dont fully understand it. Could you sketch a query template to be more precise? (I'm editing the question to add my current solution) — Radu, Jan 19 '16 at 12:38
Also, you mention that the performance will typically be different. Could you give a hint why would that be the case? — Radu, Jan 19 '16 at 12:47
I have updated my answer to address your comments. Re performance, you should read about relational query optimization, which really just means relational query implementation. There are numerous online general and product-specific books, google 'sql performance' etc. — philipxy, Jan 20 '16 at 05:49

score 0 · Answer 2 · answered Mar 18 '23 at 15:39

Spark now supports exists. Here's a link to the relevant section in the Spark 3.3.2 docs. https://spark.apache.org/docs/3.3.2/sql-ref-null-semantics.html#existsnot-exists-subquery-

-- Even if subquery produces rows with `NULL` values, the `EXISTS` expression
-- evaluates to `TRUE` as the subquery produces 1 row.
SELECT * FROM person WHERE EXISTS (SELECT null);
+--------+----+
|    name| age|
+--------+----+
|  Albert|null|
|Michelle|  30|
|    Fred|  50|
|    Mike|  18|
|     Dan|  50|
|   Marry|null|
|     Joe|  30|
+--------+----+

-- `NOT EXISTS` expression returns `FALSE`. It returns `TRUE` only when
-- subquery produces no rows. In this case, it returns 1 row.
SELECT * FROM person WHERE NOT EXISTS (SELECT null);
+----+---+
|name|age|
+----+---+
+----+---+

-- `NOT EXISTS` expression returns `TRUE`.
SELECT * FROM person WHERE NOT EXISTS (SELECT 1 WHERE 1 = 0);
+--------+----+
|    name| age|
+--------+----+
|  Albert|null|
|Michelle|  30|
|    Fred|  50|
|    Mike|  18|
|     Dan|  50|
|   Marry|null|
|     Joe|  30|
+--------+----+

Spark replacement for EXISTS and IN

2 Answers2

Linked