Merge Rows in Apache spark by eliminating null values

Question

I have a spark data frame like below

+---+----+----+----+----+----+----+
| id|   1|   2|   3|sf_1|sf_2|sf_3|
+---+----+----+----+----+----+----+
|  2|null|null|null| 102| 202| 302|
|  4|null|null|null| 104| 204| 304|
|  1|null|null|null| 101| 201| 301|
|  3|null|null|null| 103| 203| 303|
|  1|  11|  21|  31|null|null|null|
|  2|  12|  22|  32|null|null|null|
|  4|  14|  24|  34|null|null|null|
|  3|  13|  23|  33|null|null|null|
+---+----+----+----+----+----+----+

I wanted to transform data frame like below by merging null rows

+---+----+----+----+----+----+----+
| id|   1|   2|   3|sf_1|sf_2|sf_3|
+---+----+----+----+----+----+----+
|  1|  11|  21|  31| 101| 201| 301|
|  2|  12|  22|  32| 102| 202| 302|
|  4|  14|  24|  34| 104| 204| 304|
|  3|  13|  23|  33| 103| 203| 303|
+---+----+----+----+----+----+----+

preferably in scala.

anky · Accepted Answer · 2020-06-13T18:07:08.243

3

You can group on id and aggregate using first with ignorenulls for other columns:

import pyspark.sql.functions as F

(df.groupBy('id').agg(*[F.first(x,ignorenulls=True) for x in df.columns if x!='id'])
.show())

+---+----+----+----+-----+-----+-----+
| id|   1|   2|   3| sf_1| sf_2| sf_3|
+---+----+----+----+-----+-----+-----+
|  1|11.0|21.0|31.0|101.0|201.0|301.0|
|  3|13.0|23.0|33.0|103.0|203.0|303.0|
|  2|12.0|22.0|32.0|102.0|202.0|302.0|
|  4|14.0|24.0|34.0|104.0|204.0|304.0|
+---+----+----+----+-----+-----+-----+

edited Jun 13 '20 at 18:07

answered Jun 13 '20 at 17:00

anky

74,114
11
41
70

can we do this scala ? what is F? – Learnis Jun 13 '20 at 17:44
@Learnis I don't have scala installed but should be quite similar as functiins are similar in scala F is the way I had imported the module. – anky Jun 13 '20 at 18:08
1

Ok, Np @anky. thanks. I'm adding as another answer. Accepting yours – Learnis Jun 13 '20 at 18:13
Is there any better approach other that this ? because groupBy is a very heavy operation and I have around 7k columns – Learnis Jul 10 '20 at 13:43
@Learnis I don't think we can avoid group by but there may be a different approach. It's better to post a new question for visibility as this was closed. – anky Jul 10 '20 at 14:07
I understood, thanks. I have created another question, please have look. https://stackoverflow.com/questions/62841301/spark-combine-multiple-rows-to-single-row-base-on-specific-column-with-out-group Thanks – Learnis Jul 10 '20 at 19:45

score 2 · Answer 2 · answered Jun 13 '20 at 18:15

2

scala way of doing.

val inputColumns = inputLoadDF.columns.toList.drop(0)
val exprs = inputColumns.map(x => first(x,true))
inputLoadDF.groupBy("id").agg(exprs.head,exprs.tail:_*).show()

answered Jun 13 '20 at 18:15

Learnis

526
5
25

Merge Rows in Apache spark by eliminating null values

2 Answers2

Linked

Related