use corelated subquery in pyspark sql

Asked Sep 04 '19 at 17:56

Active Sep 04 '19 at 18:41

Viewed 199 times

Tab1 Columns [F,S,E]

F1 S1 R
F1 S2 R2
F1 S3 R1
F2 S1 R2
F2 S4 R4
F1 S4 R

Tab2 Columns [F,S]

F1 S1
F1 S3
F2 S1
F2 S4

TAKE ROWS FROM TAB1 FOR ONLY IF F->S RELATION IS PRESENT IN Tab2

RESULT Columns [F,S,E]

F1 S1 R
F1 S3 R
F2 S4 R4

I have the query now, but am not able to get results with pyspark.I am able to run on MySql db.

I tried to use corelated subquery in spark 2.4.3, but this returns 0 rows.

Tab1.createOrReplaceTempView("Tab1")

Tab2.createOrReplaceTempView("Tab2")

joined_df = spark.sql(
    """SELECT F, S, E FROM Tab1 
       WHERE EXISTS (SELECT * FROM Tab2 WHERE Tab1.F=Tab2.F AND Tab1.S=Tab2.S)"""
)

joined_df.show(10)

edited Sep 04 '19 at 18:41

pault

41,343
15
107
149

asked Sep 04 '19 at 17:56

saahil shah

2

Possible duplicate of [Spark replacement for EXISTS and IN](https://stackoverflow.com/questions/34861516/spark-replacement-for-exists-and-in) – pault Sep 04 '19 at 18:42
why not F1 S3 R1. The second row in your output. How come it's F1 S3 R. ? – vikrant rana Sep 05 '19 at 07:45

use corelated subquery in pyspark sql

0 Answers0