Using addPyFiles()
seems to not be adding desiered files to spark job nodes (new to spark so may be missing some basic usage knowledge here).
Attempting to run a script using pyspark and was seeing errors that certain modules are not found for import. Never used spark before, but other posts (from package in question https://github.com/cerndb/dist-keras/issues/36#issuecomment-378918484 and https://stackoverflow.com/a/39779271/8236733) recommended zipping the module and adding to the spark job via sparkContext.addPyFiles(mymodulefiles.zip)
, yet still getting error. The relevant code snippets being...
from distkeras.trainers import *
from distkeras.predictors import *
from distkeras.transformers import *
from distkeras.evaluators import *
from distkeras.utils import *
(where the package I'm importing here cann be found at https://github.com/cerndb/dist-keras),
conf = SparkConf()
conf.set("spark.app.name", application_name)
conf.set("spark.master", master) #master='yarn-client'
conf.set("spark.executor.cores", `num_cores`)
conf.set("spark.executor.instances", `num_executors`)
conf.set("spark.locality.wait", "0")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
if using_spark_2:
from pyspark.sql import SparkSession
sc = SparkSession.builder.config(conf=conf) \
.appName(application_name) \
.getOrCreate()
sc.sparkContext.addPyFile("/home/me/Downloads/distkeras.zip") # see https://github.com/cerndb/dist-keras/issues/36#issuecomment-378918484 and https://forums.databricks.com/answers/10207/view.html
print sc.version
(distkeras.zip being a zipped file of this dir.: https://github.com/cerndb/dist-keras/tree/master/distkeras), and
transformer = OneHotTransformer(output_dim=nb_classes, input_col="label_index", output_col="label")
dataset = transformer.transform(dataset)
"""throwing error...
.....
File "/opt/mapr/spark/spark-2.1.0/python/pyspark/serializers.py", line 458, in loads
return pickle.loads(obj)
ImportError: No module named distkeras.utils
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
.....
"""
From the docs and examples I could find (http://spark.apache.org/docs/2.1.0/api/python/pyspark.html#pyspark.SparkContext.addPyFile and https://forums.databricks.com/questions/10193/the-proper-way-to-add-in-dependency-py-files.html), the code above seems like it should work to me (again, never used spark before). Anyone have any idea what I'm doing wrong here? Any more info that could be posted that would be useful for debugging?