How to group unioned dataframe to combine same rows

Question

I have just unioned two dataframes in pyspark and instead of it combining the rows with the same dates, it stacked them on top of each other like so:

df1:

+----------+------------+--------------+
|      date| bounceCount|  captureCount|
+----------+------------+--------------+ 
|  20190518|           2|          null|
|  20190521|           1|          null|
|  20190519|           1|          null|
|  20190522|           1|          null|
+----------+------------+--------------+

df2:

+----------+-------------+-------------+
|      date| captureCount|  bounceCount|
+----------+-------------+-------------+ 
|  20190516|         null|            3|
|  20190518|         null|            2|
|  20190519|         null|            1|
|  20190524|         null|            5|
+----------+-------------+-------------+

union:

+----------+------------+--------------+
|      date| bounceCount|  captureCount|
+----------+------------+--------------+ 
|  20190518|           2|          null|
|  20190521|           1|          null|
|  20190519|           1|          null|
|  20190522|           1|          null|
|  20190516|        null|             3|
|  20190518|        null|             2|
|  20190519|        null|             1|
|  20190524|        null|             5|
+----------+------------+--------------+

I would like it to group it so that the rows with the same dates get combined with the correct bounceCount and captureCount:

+----------+------------+--------------+
|      date| bounceCount|  captureCount|
+----------+------------+--------------+ 
|  20190518|           2|             2|
|  20190521|           1|          null|
|  20190519|           1|             1|
|  20190522|           1|          null|
|  20190516|        null|             3|
|  20190524|        null|             5|
+----------+------------+--------------+

I have tried putting them together in different ways, and grouping the dataframe in different ways, but I cannot figure it. I will also be attaching this dataframe with several other columns, so I would like to know the best way to do this. Anyone know a simple way of doing this?

See [What are the various join types in Spark?](https://stackoverflow.com/questions/45990633/what-are-the-various-join-types-in-spark/54479858) and [update a dataframe column with new values](https://stackoverflow.com/questions/49442572/update-a-dataframe-column-with-new-values). — pault, Jun 04 '19 at 20:56
OR you can probably do a `groupBy` after the `union`, but the `join` is probably more efficient. — pault, Jun 04 '19 at 20:57

score 1 · Accepted Answer · answered Jun 05 '19 at 04:13

1

You can achieve this by outer join.

df = (
    df1.select('date', 'bounceCount')
    .join(
        df2.select('date', 'captureCount'),
        on='data', how='outer'
    )
)

answered Jun 05 '19 at 04:13

Louis Yang

3,511
1
25
24

score 1 · Answer 2 · answered Jun 05 '19 at 07:53

Try this -

Join(full) both dataframes and use coalesce function.

from pyspark.sql.functions import coalesce

joining_condition = [df1.date == df2.date]

df1\
    .join(df2,joining_condition,'full')\
    .select(coalesce(df1.date,df2.date).alias('date')
            ,df1.bounceCount
            ,df2.bounceCount.alias('captureCount'))\
    .show()

#+--------+-----------+------------+
#|    date|bounceCount|captureCount|
#+--------+-----------+------------+
#|20190518|          2|           2|
#|20190519|          1|           1|
#|20190521|          1|        null|
#|20190524|       null|           5|
#|20190522|          1|        null|
#|20190516|       null|           3|
#+--------+-----------+------------+

I think columns of df2 dataframe got interchanged. Please check. if that's the case change the column names in solution.

How to group unioned dataframe to combine same rows

df1:

df2:

union:

2 Answers2