1

I am using Spark 1.6. I have a dataframe generated from a parquet file with 6 columns. I am trying to group (partitionBy) and order(orderBy) the rows in the dataframe, to later collect those columns in an Array.

I wasn't sure if this actions were possible in Spark 1.6, but in the following answers they show how it can be done:

Based on those answers I wrote the following code:

val sqlContext: SQLContext = new HiveContext(sc) 
val conf = sc.hadoopConfiguration
val dataPath = "/user/today/*/*"
val dfSource : DataFrame = sqlContext.read.format("parquet").option("dateFormat", "DDMONYY").option("timeFormat", "HH24:MI:SS").load(dataPath)

val w = Window.partitionBy("code").orderBy("date".desc)

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number


val dfCollec = dfData.withColumn("collected", collect_list(struct("col1","col2","col3","col4","col5","col6")).over(w))

So, I followed the pattern written by Ramesh, and I created the sqlContext based on Hive as Zero recommended. But I am still getting the following error:

java.lang.UnsupportedOperationException: 'collect_list(struct('col1,'col2,'col3,'col4,'col5,'col6)) is not supported in a window operation. at org.apache.spark.sql.expressions.WindowSpec.withAggregate(WindowSpec.scala:191) at org.apache.spark.sql.Column.over(Column.scala:1052)

What am I missing still?

Ignacio Alorre
  • 7,307
  • 8
  • 57
  • 94

0 Answers0