How to find an optimized join between 2 different dataframes in spark

Question

I have a 2 different datasets, I would like to join them, but there is no easy way to do it because they don't have a common column and the crossJoin not good solution when we use a bigdata. I already asked the question on stackoverflow, but really I couldn't find an optimized solution to join them. My question on stackoverflow is: looking if String contain a sub-string in differents Dataframes

I saw these solution bellow but I didn't find a good way for my case. Efficient string suffix detection Efficient string suffix detection Efficient string matching in Apache Spark

Today, I found a funny solution :) I'm not sure if it will be work, but let's try.

I add a new column in df_1 to be contain numbering of lines.

Example df_1:

name    | id
----------------
abc     | 1232
----------------
azerty  | 87564
----------------
google  | 374856
----------------

new df_1:

name    | id       | new_id
----------------------------
abc     | 1232     |  1
----------------------------
azerty  | 87564    |  2
----------------------------
google  | 374856   |  3
----------------------------
explorer| 84763    |  4
----------------------------

The same for df_2:

Example df_2:

adress    |
-----------
UK        |
-----------
USA       |
-----------
EUROPE    |
-----------

new df_2:

adress    | new_id
-------------------
UK        |   1
-------------------
USA       |   2
-------------------
EUROPE    |   3
-------------------

Now, I have a common column between the 2 dataframes, I can do a left join using a new_id as key. My question, is this solution efficient ? How can I add new_id columns in each dataframe with numbering of line ?

For second question, I guess you are looking for `monotonically_increasing_id()` ? — chlebek, Dec 06 '19 at 10:59
@chlebek Yes, as shown in my example, if I add a new rows should new_id should be increase by one. — verojoucla, Dec 06 '19 at 11:51
`import org.apache.spark.sql.functions._ val new_df_2 = df_2.withColumn("new_id", monotonically_increasing_id())` — chlebek, Dec 06 '19 at 12:43
I used your solution, it seems work well, but I didn't understand why idx value is different, in df_1 for example is created it from 0 value, that what I need, but in df_2, is started from 8589934592. — verojoucla, Dec 06 '19 at 13:13
I forgot that it isn't consequential, you can try `row_number` window function, `df_2.withColumn("new_id",row_number().over(Window.partitionBy(lit(1)).orderBy(lit(1))))` — chlebek, Dec 06 '19 at 13:37

score 0 · Answer 1 · answered Dec 06 '19 at 11:04

0

As the Spark is Lazy Evaluation ,it means that the execution will not start until an action is triggered . So what you can do is simply call spark context createdataframe function and pass list of selected columns from df1 and df2 . It will create a new dataframe as you need.

e.g. df3 = spark.createDataframe([df1.select(''),df2.select('')])

Upvote if works

answered Dec 06 '19 at 11:04

Ajinkya Bhore

144
1
1
12

AttributeError: 'SparkSession' object has no attribute 'createDataframe' I got this error – verojoucla Dec 06 '19 at 11:56
the variable spark here is of spark context object . And in your code its of spark session object . you need to use 'spark context object'.createDataframe([df1.select(''),df2.select('')]) – Ajinkya Bhore Dec 06 '19 at 12:20

How to find an optimized join between 2 different dataframes in spark

1 Answers1

Linked