I'm using Pyspark in Zeppelin 0.7.3 with Spark 2.2.0.
In one Zeppelin %pyspark
paragraph I create an RDD using various methods (mapPartitions
and flatMap
) and finally call collect()
to look at the results as a list. I can run this paragraph repeatedly and it always works.
But when I then try to create a Spark DataFrame from that RDD in a second Zeppelin paragraph by using spark.createDataFrame()
and call show()
on that dataframe it works as expected only the first time I execute it! If I run this second paragraph a 2nd, 3rd time, etc I get a long stacktrace ending in this:
File "/usr/lib/python2.7/pickle.py", line 306, in save
rv = reduce(self.proto)
File "/home/lucas/spark/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/lucas/spark/spark-2.2.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/lucas/spark/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value
format(target_id, ".", name, value))
Py4JError: An error occurred while calling o1571.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
If I re-run the first paragraph that creates the RDD then the first time I run the second paragraph it works; the next time it fails with the same error.
Has anyone else seen such odd behaviour as this and can think what might be causing it?!
The answer to PySpark Throwing error Method __getnewargs__([]) does not exist says not to nest RDDs or pass the SparkContext to the executor nodes. I'm not doing that as far as I can tell. I've also checked that all the args to the function I'm calling in mapPartitions()
are picklable and can be passed in a dummy function.
I haven't tried running this outside of Zeppelin yet in case it's a Zeppelin bug but I'm out of other ideas...