2

I have a function that adds 2 columns:

def sum_num (num1: Int, num2: Int): Int = {
    return num1 + num2
}

I have a dataframe df with below values

+----+----+----+
|col1|col2|col3|
+----+----+----+
|1   |2   |5   |
|7   |4   |4   |
+----+----+----+

I want to add a column and pass column names to the function but the below code is not working. It gives error found Column required is Int

val newdf = df.withColumn("sum_of_cols1", sum_num($col1, $ col2))
              .withColumn("sum_of_cols2", sum_num($col1, $ col3))
  • 1
    Does [this](https://stackoverflow.com/a/37227528/2501279) help? – Guru Stron Mar 09 '21 at 21:03
  • @GuruStron I had seen this but not sure how to create the udf using multiple columns. Also i had read that udf might have some performance implications and since I am doing the calculation on billion records trying to find other solutions! – justanothertekguy Mar 10 '21 at 06:51

1 Answers1

4

Change your code to:

import spark.implicits._

def sum_num (num1: Column, num2: Column): Column = {
  return num1 + num2
}

val newdf = df.withColumn("sum_of_cols1", sum_num($"col1", $"col2"))
  .withColumn("sum_of_cols2", sum_num($"col1", $"col3"))

You must operate over Spark SQL columns. You can do arithmetic operations with them. Take a look to the operators that can be used

Emiliano Martinez
  • 4,073
  • 2
  • 9
  • 19
  • thanks for this. for function sum_num does the output need to be a column or can i out Int there since I need that value to do some other calculation – justanothertekguy Mar 10 '21 at 13:49
  • you can operate with column directly, using the available operators. Don´t worry about that, if you need other specific functions you could add an udf that takes as input a column too. – Emiliano Martinez Mar 10 '21 at 13:54