How to apply a function to a column of a Spark DataFrame?

Question

Let's assume that we have a Spark DataFrame

df.getClass
Class[_ <: org.apache.spark.sql.DataFrame] = class org.apache.spark.sql.DataFrame

with the following schema

df.printSchema
root
|-- rawFV: string (nullable = true)
|-- tk: array (nullable = true)
|    |-- element: string (containsNull = true)

Given that each row of the tk column is an array of strings, how to write a Scala function that will return the number of elements in each row?

zero323 · Accepted Answer · 2016-01-05T15:14:58.363

19

You don't have to write a custom function because there is one:

import org.apache.spark.sql.functions.size

df.select(size($"tk"))

If you really want you can write an udf:

import org.apache.spark.sql.functions.udf

val size_ = udf((xs: Seq[String]) => xs.size)

or even create custom a expression but there is really no point in that.

edited Jan 05 '16 at 15:14

answered Jan 05 '16 at 15:09

zero323

Perfect! For generality, I would like to know how to apply a UDF to a dataframe. Could you point me to a simple example? – ranlot Jan 05 '16 at 15:16
1

There are dozens of examples on SO ([a couple of examples](https://stackoverflow.com/search?q=user%3A1560062+import+org.apache.spark.sql.functions.udf+[apache-spark])) and as always source (especially tests) are good place to start. – zero323 Jan 05 '16 at 15:21
How would you use this size_ function? – ranlot Jan 05 '16 at 15:22
Same way as built-in `size` ( `size_($"tk")`). – zero323 Jan 05 '16 at 15:23
What about if I want to define size_ with a def? I understand it may look like a complete overkill but this way it would be very easy to change to something else. – ranlot Jan 05 '16 at 15:41
Just pass it to the udf wrapper. – zero323 Jan 05 '16 at 16:07
In Spark 1.6.1, I get the following error when using this approach: error: type mismatch; found : org.apache.spark.sql.Column required: org.apache.spark.sql.TypedColumn[TitleRecs,?] titleRecs.select(size($"sims")) – Alexander Tronchin-James Dec 02 '16 at 23:22

score 1 · Answer 2 · edited Jan 05 '16 at 15:18

1

One way is to access them using the sql like below.

df.registerTempTable("tab1")
val df2 = sqlContext.sql("select tk[0], tk[1] from tab1")

df2.show()

To get size of array column,

val df3 = sqlContext.sql("select size(tk) from tab1")
df3.show()

If your Spark version is older, you can use HiveContext instead of Spark's SQL Context.

I would also try for some thing that traverses.

edited Jan 05 '16 at 15:18

Tunaki

answered Jan 05 '16 at 14:55

Srini

2 Answers2