Spark SQL: Select with arithmetic on column values and type casting?

Question

I'm using Spark SQL with DataFrames. Is there a way to do a select statement with some arithmetic, just as you can in SQL?

For example, I have the following table:

var data = Array((1, "foo", 30, 5), (2, "bar", 35, 3), (3, "foo", 25, 4))
var dataDf = sc.parallelize(data).toDF("id", "name", "value", "years")

dataDf.printSchema
// root
//  |-- id: integer (nullable = false)
//  |-- name: string (nullable = true)
//  |-- value: integer (nullable = false)
//  |-- years: integer (nullable = false)

dataDf.show()
// +---+----+-----+-----+
// | id|name|value|years|
// +---+----+-----+-----+
// |  1| foo|   30|    5|
// |  2| bar|   35|    3|
// |  3| foo|   25|    4|
//+---+----+-----+-----+

Now, I would like to do a SELECT statement that creates a new column with some arithmetic performed on the existing columns. For example, I would like to compute the ratio value/years. I need to convert value (or years) to a double first. I tried this statement, but it wouldn't parse:

dataDf.
    select(dataDf("name"), (dataDf("value").toDouble/dataDf("years")).as("ratio")).
    show()

<console>:35: error: value toDouble is not a member of org.apache.spark.sql.Column
              select(dataDf("name"), (dataDf("value").toDouble/dataDf("years")).as("ratio")).

I saw a similar question in "How to change column types in Spark SQL's DataFrame?", but that's not quite what I want.

score 4 · Accepted Answer · edited May 15 '18 at 12:30

4

A proper way to change type of a Column is to use cast method. It can either take a description string:

dataDf("value").cast("double") / dataDf("years")

or a DataType:

import org.apache.spark.sql.types.DoubleType

dataDf("value").cast(DoubleType) / dataDf("years")

edited May 15 '18 at 12:30

Community

1
1

answered Mar 01 '16 at 00:13

zero323

322,348
103
959
935

Thanks. This method for casting works. `dataDf.select(dataDf("name"), (dataDf("value").cast("double")/dataDf("years")).as("ratio")).show` – stackoverflowuser2010 Mar 02 '16 at 00:17

Alberto Bonsanto · Answer 2 · 2016-03-01T00:30:24.220

Well if it's not a requirement to use a select method, you can just use withColumn.

val resDF = dataDf.withColumn("result", col("value").cast("double") / col("years"))
resDF.show

//+---+----+-----+-----+------------------+
//| id|name|value|years|            result|
//+---+----+-----+-----+------------------+
//|  1| foo|   30|    5|               6.0|
//|  2| bar|   35|    3|11.666666666666666|
//|  3| foo|   25|    4|              6.25|
//+---+----+-----+-----+------------------+

If it's a requirement to use a select, one option could be:

val exprs = dataDf.columns.map(col(_)) ++ List((col("value").cast("double") / col("years")).as("result"))
dataDf.select(exprs: _*).show

//+---+----+-----+-----+------------------+
//| id|name|value|years|            result|
//+---+----+-----+-----+------------------+
//|  1| foo|   30|    5|               6.0|
//|  2| bar|   35|    3|11.666666666666666|
//|  3| foo|   25|    4|              6.25|
//+---+----+-----+-----+------------------+

Spark SQL: Select with arithmetic on column values and type casting?

2 Answers2