How to combine multiple rows in spark dataframe into single row having the same id

Question

I have multiple rows(with same id) of data in spark scala dataframe. how to combine the data for all the rows into single row.

Below screenshot consists of input data and expected data

Does [this](https://stackoverflow.com/questions/62362983/merge-rows-in-apache-spark-by-eliminating-null-values) answer your question? — leleogere, Aug 31 '22 at 06:41

score 1 · Answer 1 · answered Aug 31 '22 at 12:45

You could do it with a Window function and then aggregating with pyspark.sql.functions.first but ignoring the nulls with ignorenulls=True instead of the default ignorenulls=False. Finally, take a .distinct() to get rid of the duplicate rows (3 each in this case) as the aggregation happens for every row.

from pyspark.sql import functions as F, Window

window_spec = Window.partitionBy("eid")
cols = df.columns
df = (df.select(*[(F.first(col, ignorenulls=True).over(window_spec)).alias(col) for col in cols])
     .distinct()
     )

df.show()

Output:

+---+-----+----+-------+-----+-----------+--------+
|eid|ename|esal|  eaddr|edept|designation|isactive|
+---+-----+----+-------+-----+-----------+--------+
|123|  abc|1000|newyork|   IT|    manager|       y|
|456|  def|2000|chicago| mech|       lead|       n|
+---+-----+----+-------+-----+-----------+--------+

How to combine multiple rows in spark dataframe into single row having the same id

1 Answers1