I know this question have been similarly asked here but nobody has raised up the case when one or both dataframes in the reduce are empty. Performing an inner join the final result in that case will be always empty
If I have:
dfList = List(
+--------+-------------+ +--------+-------------+ +--------+--------+ +--------+--------+
| ID | A | ID | B | | ID | C | | ID | D |
+--------+-------------+ +--------+-------------+ +--------+--------+ +--------+--------+
| 9574| F| | 9574| 005912| | 9574| 2016022| | 9574| VD|
| 9576| F| | 9576| 005912| | 9576| 2016022| | 9576| VD|
| 9578| F| | 9578| 005912| | 9578| 2016022| | 9578| VD|
| 9580| F| | 9580| 005912| | 9580| 2016022| | 9580| VD|
| 9582| F| | 9582| 005912| | 9582| 2016022| | 9582| VD|
+--------+-------------+, +--------+-------------+,+--------+--------+,+--------+--------+
)
and I want to reduce this list into a single dataframe, I could easily do:
listDataFrames.reduce(
(df1, df2) =>
df1.join(df2, Seq("ID")).localCheckpoint(true)
)
Whatever the join is (inner, left, or right), if one of the first couple of dataframes is empty, the final result will be empty.
One way could be:
listDataFrames.filter(!_.rdd.isEmpty())
but it takes a lot of time, so it is not too good in terms of performance. Do you have some suggestion?