1

I have a 2 different datasets, I would like to join them, but there is no easy way to do it because they don't have a common column and the crossJoin not good solution when we use a bigdata. I already asked the question on stackoverflow, but really I couldn't find an optimized solution to join them. My question on stackoverflow is: looking if String contain a sub-string in differents Dataframes

I saw these solution bellow but I didn't find a good way for my case. Efficient string suffix detection Efficient string suffix detection Efficient string matching in Apache Spark

Today, I found a funny solution :) I'm not sure if it will be work, but let's try.

I add a new column in df_1 to be contain numbering of lines.

Example df_1:

name    | id
----------------
abc     | 1232
----------------
azerty  | 87564
----------------
google  | 374856
----------------

new df_1:

name    | id       | new_id
----------------------------
abc     | 1232     |  1
----------------------------
azerty  | 87564    |  2
----------------------------
google  | 374856   |  3
----------------------------
explorer| 84763    |  4
----------------------------

The same for df_2:

Example df_2:

adress    |
-----------
UK        |
-----------
USA       |
-----------
EUROPE    |
-----------

new df_2:

adress    | new_id
-------------------
UK        |   1
-------------------
USA       |   2
-------------------
EUROPE    |   3
-------------------

Now, I have a common column between the 2 dataframes, I can do a left join using a new_id as key. My question, is this solution efficient ? How can I add new_id columns in each dataframe with numbering of line ?

verojoucla
  • 599
  • 2
  • 12
  • 23
  • 1
    For second question, I guess you are looking for `monotonically_increasing_id()` ? – chlebek Dec 06 '19 at 10:59
  • @chlebek Yes, as shown in my example, if I add a new rows should new_id should be increase by one. – verojoucla Dec 06 '19 at 11:51
  • @chlebek do you have an idea how do it please ? – verojoucla Dec 06 '19 at 12:02
  • 1
    `import org.apache.spark.sql.functions._ val new_df_2 = df_2.withColumn("new_id", monotonically_increasing_id())` – chlebek Dec 06 '19 at 12:43
  • I used your solution, it seems work well, but I didn't understand why idx value is different, in df_1 for example is created it from 0 value, that what I need, but in df_2, is started from 8589934592. – verojoucla Dec 06 '19 at 13:13
  • 1
    I forgot that it isn't consequential, you can try `row_number` window function, `df_2.withColumn("new_id",row_number().over(Window.partitionBy(lit(1)).orderBy(lit(1))))` – chlebek Dec 06 '19 at 13:37

1 Answers1

0

As the Spark is Lazy Evaluation ,it means that the execution will not start until an action is triggered . So what you can do is simply call spark context createdataframe function and pass list of selected columns from df1 and df2 . It will create a new dataframe as you need.

e.g. df3 = spark.createDataframe([df1.select(''),df2.select('')])

Upvote if works

Ajinkya Bhore
  • 144
  • 1
  • 1
  • 12
  • AttributeError: 'SparkSession' object has no attribute 'createDataframe' I got this error – verojoucla Dec 06 '19 at 11:56
  • the variable spark here is of spark context object . And in your code its of spark session object . you need to use 'spark context object'.createDataframe([df1.select(''),df2.select('')]) – Ajinkya Bhore Dec 06 '19 at 12:20