2

I have the following problem:

  • A DataFrame containing col1 with strings A, B, or C.
  • A second col2 with an Integer.
  • And three other columns col3, col4 and col5 (these columns are also named A, B, and C).

Thus,

 col1 - col2 - A (col3) - B (col4) - C (col5)
|--------------------------------------------
   A      6
   B      5
   C      6

should obtain

 col1 - col2 - A (col3) - B (col4) - C (col5)
|--------------------------------------------
   A      6       6
   B      5                  5
   C      6                              6

Now I would like to go through each row and assign the integer in col2 to the column A, B or C based on the entry in col1.

How do I achieve this?

df.withColumn() I cannot use (or at least I do not know why) and the same holds for val df2 = df.map(x => x ).

Looking forward to you help and thanks in advance!

Best, Ken

zero323
  • 322,348
  • 103
  • 959
  • 935
Ken Jiiii
  • 474
  • 9
  • 21

1 Answers1

2

Create a mapping between key and target column:

val mapping = Seq(("A", "col3"), ("B", "col4"), ("C", "col5"))

Use it to generate sequence of columns:

import org.apache.spark.sql.functions.when

val exprs = mapping.map { case (key, target) => 
  when($"col1" === key, $"col2").alias(target) }

Prepend star and select:

val df = Seq(("A", 6), ("B", 5), ("C", 6)).toDF("col1", "col2")
df.select($"*" +: exprs: _*)

The result is:

+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
|   A|   6|   6|null|null|
|   B|   5|null|   5|null|
|   C|   6|null|null|   6|
+----+----+----+----+----+
zero323
  • 322,348
  • 103
  • 959
  • 935