SparkR window function : Error "Task not serializable"

Question

I try to test Window function thanks to Spark SQL module from SparkR. I use Spark 1.6 and I try to replicate the example provided by zero323 in two different deploy modes (local and yarn-client).

set.seed(1)

hc <- sparkRHive.init(sc)
sdf <- createDataFrame(hc, data.frame(x=1:12, y=1:3, z=rnorm(12)))
registerTempTable(sdf, "sdf")

query <- sql(hc, "SELECT x, y, z, LAG(z) OVER (PARTITION BY y ORDER BY x) FROM sdf") 
head(query)

##    x y          z        _c3
## 1  1 1 -0.6264538         NA
## 2  4 1  1.5952808 -0.6264538
## 3  7 1  0.4874291  1.5952808
## 4 10 1 -0.3053884  0.4874291
## 5  2 2  0.1836433         NA
## 6  5 2  0.3295078  0.1836433

But for the two deployment modes, I get the same error when i execute the Spark Action head(query) :

16/01/21 18:03:17 ERROR r.RBackendHandler: dfToCols on     org.apache.spark.sql.api.r.SQLUtils failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
org.apache.spark.SparkException: Task not serializable
at    org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
   at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706)
at org.apache.spark.sql.execution.Window.doExecute(Window.scala:245)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.

I tried this HQL query directly into HIVE and working properly. Also "normal" queries like classical_query <- sql(hc, "SELECT * FROM sdf") head(classical_query) works fine.

Thx

score 0 · Answer 1 · answered Jan 22 '16 at 18:11

0

I solved my problem. It was just a Spark configuration problem.

I just removed the /usr/hdp/current/hive-client/lib/hive-exec.jar JAR from the spark.driver.extraClassPath variable in the spark-defaults.confconfiguration file.

answered Jan 22 '16 at 18:11

Villo

391
1
3
6

SparkR window function : Error "Task not serializable"

1 Answers1