I have just unioned two dataframes in pyspark and instead of it combining the rows with the same dates, it stacked them on top of each other like so:
df1:
+----------+------------+--------------+
| date| bounceCount| captureCount|
+----------+------------+--------------+
| 20190518| 2| null|
| 20190521| 1| null|
| 20190519| 1| null|
| 20190522| 1| null|
+----------+------------+--------------+
df2:
+----------+-------------+-------------+
| date| captureCount| bounceCount|
+----------+-------------+-------------+
| 20190516| null| 3|
| 20190518| null| 2|
| 20190519| null| 1|
| 20190524| null| 5|
+----------+-------------+-------------+
union:
+----------+------------+--------------+
| date| bounceCount| captureCount|
+----------+------------+--------------+
| 20190518| 2| null|
| 20190521| 1| null|
| 20190519| 1| null|
| 20190522| 1| null|
| 20190516| null| 3|
| 20190518| null| 2|
| 20190519| null| 1|
| 20190524| null| 5|
+----------+------------+--------------+
I would like it to group it so that the rows with the same dates get combined with the correct bounceCount
and captureCount
:
+----------+------------+--------------+
| date| bounceCount| captureCount|
+----------+------------+--------------+
| 20190518| 2| 2|
| 20190521| 1| null|
| 20190519| 1| 1|
| 20190522| 1| null|
| 20190516| null| 3|
| 20190524| null| 5|
+----------+------------+--------------+
I have tried putting them together in different ways, and grouping the dataframe in different ways, but I cannot figure it. I will also be attaching this dataframe with several other columns, so I would like to know the best way to do this. Anyone know a simple way of doing this?