I have multiple rows(with same id) of data in spark scala dataframe. how to combine the data for all the rows into single row.
Below screenshot consists of input data and expected data
I have multiple rows(with same id) of data in spark scala dataframe. how to combine the data for all the rows into single row.
Below screenshot consists of input data and expected data
You could do it with a Window
function and then aggregating with pyspark.sql.functions.first
but ignoring the nulls with ignorenulls=True
instead of the default ignorenulls=False
. Finally, take a .distinct()
to get rid of the duplicate rows (3 each in this case) as the aggregation happens for every row.
from pyspark.sql import functions as F, Window
window_spec = Window.partitionBy("eid")
cols = df.columns
df = (df.select(*[(F.first(col, ignorenulls=True).over(window_spec)).alias(col) for col in cols])
.distinct()
)
df.show()
Output:
+---+-----+----+-------+-----+-----------+--------+
|eid|ename|esal| eaddr|edept|designation|isactive|
+---+-----+----+-------+-----+-----------+--------+
|123| abc|1000|newyork| IT| manager| y|
|456| def|2000|chicago| mech| lead| n|
+---+-----+----+-------+-----+-----------+--------+