1

I have two data frames and I am doing outer join. So when I do outer join it leaves me with so many duplicate rows with same header.

Here is my df1:

+----------+------+---------+--------+------+
|     OrgId|ItemId|segmentId|Sequence|Action|
+----------+------+---------+--------+------+
|4295877341|   136|        4|       1|  I|!||
|4295877346|   136|        4|       1|  I|!||
|4295877341|   138|        2|       1|  I|!||
|4295877341|   141|        4|       1|  I|!||
|4295877341|   143|        2|       1|  I|!||
|4295877341|   145|       14|       1|  I|!||
| 123456789|   145|       14|       1|  I|!||
+----------+------+---------+--------+------+

Here is my df2:

+----------+------+---------+--------+------+
|     OrgId|ItemId|segmentId|Sequence|Action|
+----------+------+---------+--------+------+
|4295877343|   149|       15|       2|  I|!||
|4295877341|   136|     null|    null|  I|!||
| 123456789|   145|       14|       1|  D|!||
|4295877341|   138|       11|      22|  I|!||
|4295877341|   141|       10|       1|  I|!||
|4295877341|   143|        1|       1|  I|!||
+----------+------+---------+--------+------+

Now I am joining both data frame in order to get below expected output.

+----------+------+---------+--------+------+
|     OrgId|ItemId|segmentId|Sequence|Action|
+----------+------+---------+--------+------+
|4295877346|   136|        4|       1|  I|!||
|4295877341|   145|       14|       1|  I|!||
|4295877343|   149|       15|       2|  I|!||
|4295877341|   136|     null|    null|  I|!||
|4295877341|   138|       11|      22|  I|!||
|4295877341|   141|       10|       1|  I|!||
|4295877341|   143|        1|       1|  I|!||
+----------+------+---------+--------+------+

Please note that here in the output I update data frame 1 with data frame 2 records for the matching records and new records in df2 is also getting in the output and if any records in df1 not matching with df2 that also will be in the output data frame.

To do this here is what I have tried.

Firstly I have tried this.

val df3 = df1.join(latestForEachKey, Seq("OrgId", "ItemId"), "outer")

then i got below output ..

+----------+------+---------+--------+------+---------+--------+------+
|     OrgId|ItemId|segmentId|Sequence|Action|segmentId|Sequence|Action|
+----------+------+---------+--------+------+---------+--------+------+
|4295877346|   136|        4|       1|  I|!||     null|    null|  null|
|4295877341|   145|       14|       1|  I|!||     null|    null|  null|
|4295877343|   149|     null|    null|  null|       15|       2|  I|!||
|4295877341|   136|        4|       1|  I|!||     null|    null|  I|!||
| 123456789|   145|       14|       1|  I|!||       14|       1|  D|!||
|4295877341|   138|        2|       1|  I|!||       11|      22|  I|!||
|4295877341|   141|        4|       1|  I|!||       10|       1|  I|!||
|4295877341|   143|        2|       1|  I|!||        1|       1|  I|!||
+----------+------+---------+--------+------+---------+--------+------+

Then i tried with renaming the df2 column renamed and executed below code .

val df3 = df1.join(latestForEachKey, Seq("OrgId", "ItemId"), "outer")
      .select($"OrgId", $"ItemId",
        when($"segmentId_1".isNotNull, $"segmentId_1").otherwise($"segmentId").as("segmentId"),
        when($"Sequence_1".isNotNull, $"Sequence_1").otherwise($"Sequence").as("Sequence"),
        when($"Action_1".isNotNull, $"Action_1").otherwise($"Action").as("Action")).filter(!$"Action".contains("D"))

Now this has given me the correct output but I have to write so may times when condition.

I have origin data set with 60 columns so then it might the problem.

Please suggest some way out.

Zissouu
  • 924
  • 2
  • 10
  • 17
  • One alternative is to generate an SQL statement out of the column names of the dataframe and finally execute that. – Avishek Bhattacharya Oct 02 '17 at 13:20
  • @AvishekBhattacharya then it might slow down the process... –  Oct 02 '17 at 13:48
  • Why will it slow down the process? Spark sql and dataframe operations don't have any difference in performance. If you concerned about sql generation, that is extremely fast process as there are only 60 columns – Avishek Bhattacharya Oct 02 '17 at 13:50
  • 1
    https://stackoverflow.com/questions/35258506/spark-duplicate-columns-in-dataframe-after-join : I think this is the expected behaviour. Since you are doing an outer join, it means that column names from both the dataframes have to be present. – user238607 Oct 02 '17 at 13:54
  • @user238607 but in my case i have similar columns in both data frame –  Oct 02 '17 at 13:57
  • @user7547751, are you still facing this problem ?? you can try an udf .. – Sarath Chandra Vema Oct 14 '19 at 10:24

0 Answers0