2

I have a Dataframe like this:

val df = Seq(
  Seq(("a","b","c"))
  )
.toDF("arr")
.select($"arr".cast("array<struct<c1:string,c2:string,c3:string>>"))

df.printSchema

root
 |-- arr: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- c1: string (nullable = true)
 |    |    |-- c2: string (nullable = true)
 |    |    |-- c3: string (nullable = true)

df.show()
+---------+
|      arr|
+---------+
|[[a,b,c]]|
+---------+

I want to select only c1 and c3, such that:

df.printSchema

root
 |-- arr: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- c1: string (nullable = true)
 |    |    |-- c3: string (nullable = true)

df.show()

+---------+
|      arr|
+---------+
|[[a,c]]  |
+---------+

Can this be done without UDF?

I can do it with an UDF, but I'd like a solution without it, something like

df
.select($"arr.c1".as("arr"))

root
 |-- arr: array (nullable = true)
 |    |-- element: string (containsNull = true)

But this only works to select 1 struct element, I've also tried :

df
.select(array(struct($"arr.c1",$"arr.c3")).as("arr"))

but this gives

root
 |-- arr: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- c1: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- c3: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
  • related: https://stackoverflow.com/questions/32727279/dropping-a-nested-column-from-spark-dataframe – mtoto Feb 19 '18 at 11:29

1 Answers1

0

I can only give an answer for the Python API but I am sure the Scala API has something very similar.

The key is the function arrays_zip, which, according to the documentation, "[r]eturns a merged array of structs in which the N-th struct contains all N-th values of input arrays."

Example (still from the documentation):

from pyspark.sql.functions import arrays_zip
df = spark.createDataFrame([(([1, 2, 3], [2, 3, 4]))], ['vals1', 'vals2'])
df.select(arrays_zip(df.vals1, df.vals2).alias('zipped')).collect()
# Prints: [Row(zipped=[Row(vals1=1, vals2=2), Row(vals1=2, vals2=3), Row(vals1=3, vals2=4)])]
ingomueller.net
  • 4,097
  • 2
  • 36
  • 33