8

Let's assume that we have a Spark DataFrame

df.getClass
Class[_ <: org.apache.spark.sql.DataFrame] = class org.apache.spark.sql.DataFrame

with the following schema

df.printSchema
root
|-- rawFV: string (nullable = true)
|-- tk: array (nullable = true)
|    |-- element: string (containsNull = true)

Given that each row of the tk column is an array of strings, how to write a Scala function that will return the number of elements in each row?

Community
  • 1
  • 1
ranlot
  • 636
  • 1
  • 6
  • 14

2 Answers2

19

You don't have to write a custom function because there is one:

import org.apache.spark.sql.functions.size

df.select(size($"tk"))

If you really want you can write an udf:

import org.apache.spark.sql.functions.udf

val size_ = udf((xs: Seq[String]) => xs.size)

or even create custom a expression but there is really no point in that.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • Perfect! For generality, I would like to know how to apply a UDF to a dataframe. Could you point me to a simple example? – ranlot Jan 05 '16 at 15:16
  • 1
    There are dozens of examples on SO ([a couple of examples](https://stackoverflow.com/search?q=user%3A1560062+import+org.apache.spark.sql.functions.udf+[apache-spark])) and as always source (especially tests) are good place to start. – zero323 Jan 05 '16 at 15:21
  • How would you use this size_ function? – ranlot Jan 05 '16 at 15:22
  • Same way as built-in `size` ( `size_($"tk")`). – zero323 Jan 05 '16 at 15:23
  • What about if I want to define size_ with a def? I understand it may look like a complete overkill but this way it would be very easy to change to something else. – ranlot Jan 05 '16 at 15:41
  • Just pass it to the udf wrapper. – zero323 Jan 05 '16 at 16:07
  • In Spark 1.6.1, I get the following error when using this approach: error: type mismatch; found : org.apache.spark.sql.Column required: org.apache.spark.sql.TypedColumn[TitleRecs,?] titleRecs.select(size($"sims")) – Alexander Tronchin-James Dec 02 '16 at 23:22
1

One way is to access them using the sql like below.

df.registerTempTable("tab1")
val df2 = sqlContext.sql("select tk[0], tk[1] from tab1")

df2.show()

To get size of array column,

val df3 = sqlContext.sql("select size(tk) from tab1")
df3.show()

If your Spark version is older, you can use HiveContext instead of Spark's SQL Context.

I would also try for some thing that traverses.

Tunaki
  • 132,869
  • 46
  • 340
  • 423
Srini
  • 3,334
  • 6
  • 29
  • 64