0

I am grouping my spark Dataframe by a filed and trying to collect all the elements associated with that group/key in an array. I am using collect_list() inside .agg(). I am using Scala Like:

val ndf = grp.agg(collect_list(col("site")))

Here grp is the data frame I get after grouping and "site" is the column that I am collecting the entries from.

This works if I run in the spark-shell. But not when I running my entire code with spark-submit. I am importing:

import org.apache.spark.sql.functions._

Which is where this collect_list method is.

Both Spark version are same. The only difference is that spark-shell initializes Hivecontext by default but is not in my flow. But from what I know this has nothing to do with hive context.

Whats the issue here? Someone also has the same issue here:

http://apache-spark-user-list.1001560.n3.nabble.com/Use-collect-list-and-collect-set-in-Spark-SQL-td26280.html

https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/functions.html

cryp
  • 2,285
  • 3
  • 26
  • 33
  • I looked at it. Dosent work. I tried importing `import org.apache.spark.sql.hive.HiveContext`. I got `object hive is not a member of package org.apache.spark.sql` – cryp May 25 '16 at 18:26
  • Did you add a dependeny on [spark sql](http://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10)? – Yuval Itzchakov May 25 '16 at 18:29
  • 3
    @cryptX If `collect_list` is only available in the hive context, you have to import `org.apache.spark.sql.hive.HiveContext` **and** the [spark-hive](http://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.10/1.6.1) package must be set as a dependency of your project – Daniel de Paula May 25 '16 at 21:50
  • so I was able to fix it. I had to make changes in my POM. I had to add dependency for hive in POM for this to work. Thanks all for the help !! – cryp May 26 '16 at 13:54

0 Answers0