0

I have multiple rows(with same id) of data in spark scala dataframe. how to combine the data for all the rows into single row.

Below screenshot consists of input data and expected data

Dataframe with multiple rows

bigdata techie
  • 147
  • 1
  • 11
  • 2
    Does [this](https://stackoverflow.com/questions/62362983/merge-rows-in-apache-spark-by-eliminating-null-values) answer your question? – leleogere Aug 31 '22 at 06:41

1 Answers1

1

You could do it with a Window function and then aggregating with pyspark.sql.functions.first but ignoring the nulls with ignorenulls=True instead of the default ignorenulls=False. Finally, take a .distinct() to get rid of the duplicate rows (3 each in this case) as the aggregation happens for every row.

from pyspark.sql import functions as F, Window

window_spec = Window.partitionBy("eid")
cols = df.columns
df = (df.select(*[(F.first(col, ignorenulls=True).over(window_spec)).alias(col) for col in cols])
     .distinct()
     )

df.show()

Output:

+---+-----+----+-------+-----+-----------+--------+
|eid|ename|esal|  eaddr|edept|designation|isactive|
+---+-----+----+-------+-----+-----------+--------+
|123|  abc|1000|newyork|   IT|    manager|       y|
|456|  def|2000|chicago| mech|       lead|       n|
+---+-----+----+-------+-----+-----------+--------+
viggnah
  • 1,709
  • 1
  • 3
  • 12