Multiple-columns operations in Spark

Question

Using Python's Pandas, one can do bulk operations on multiple columns in one pass like this:

# assuming we have a DataFrame with, among others, the following columns
cols = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8']
df[cols] = df[cols] / df['another_column']

Is there a similar functionality using Spark in Scala?

Currently I end up doing:

val df2 = df.withColumn("col1", $"col1" / $"another_column")
            .withColumn("col2", $"col2" / $"another_column")
            .withColumn("col3", $"col3" / $"another_column")
            .withColumn("col4", $"col4" / $"another_column")
            .withColumn("col5", $"col5" / $"another_column")
            .withColumn("col6", $"col6" / $"another_column")
            .withColumn("col7", $"col7" / $"another_column")
            .withColumn("col8", $"col8" / $"another_column")

score 3 · Answer 1 · answered Sep 20 '17 at 16:54

You can use foldLeft to process the column list as below:

val df = Seq(
  (1, 20, 30, 4),
  (2, 30, 40, 5),
  (3, 10, 30, 2)
).toDF("id", "col1", "col2", "another_column")

val cols = Array("col1", "col2")

val df2 = cols.foldLeft( df )( (acc, c) =>
  acc.withColumn( c, df(c) / df("another_column") )
)

df2.show
+---+----+----+--------------+
| id|col1|col2|another_column|
+---+----+----+--------------+
|  1| 5.0| 7.5|             4|
|  2| 6.0| 8.0|             5|
|  3| 5.0|15.0|             2|
+---+----+----+--------------+

score 1 · Answer 2 · answered Sep 20 '17 at 17:07

For completeness: a slightly different version from @Leo C's, not using foldLeft but a single select expression instead:

import org.apache.spark.sql.functions._
import spark.implicits._

val toDivide = List("col1", "col2")
val newColumns = toDivide.map(name => col(name) / col("another_column") as name)

val df2 = df.select(($"id" :: newColumns) :+ $"another_column": _*)

Produces the same output.

score 1 · Answer 3 · answered Sep 20 '17 at 17:19

You can use plain select on operated columns. The solution is very similar to the Python Panda solution.

//Define the dataframe df1
case class ARow(col1: Int, col2: Int, anotherCol: Int)
val df1 = spark.createDataset(Seq(
  ARow(1, 2, 3), 
  ARow(4, 5, 6), 
  ARow(7, 8, 9))).toDF

// Perform the operation using a map
val cols = Array("col1", "col2")
val opCols = cols.map(c => df1(c)/df1("anotherCol"))

// Select the columns operated
val df2 = df1.select(opCols: _*)

The .show on df2

df2.show()
+-------------------+-------------------+
|(col1 / anotherCol)|(col2 / anotherCol)|
+-------------------+-------------------+
| 0.3333333333333333| 0.6666666666666666|
| 0.6666666666666666| 0.8333333333333334|
| 0.7777777777777778| 0.8888888888888888|
+-------------------+-------------------+

it's also very similar (identical) to the one I posted 40 minutes ago... — Tzach Zohar, Sep 20 '17 at 17:47
Didn't see it. Do you mind if I edit the other one? It has unnecessary imports and select is unreadable. — elghoto, Sep 20 '17 at 17:48

Multiple-columns operations in Spark

3 Answers3

Linked