I am running into problem running my script with spark-submit. The main script won't even run because import pymongo_spark
returns ImportError: No module named pymongo_spark
I checked this thread and this thread to try to figure out the issue, but so far there's no result.
My setup:
$HADOOP_HOME
is set to /usr/local/cellar/hadoop/2.7.1
where my hadoop files are
$SPARK_HOME
is set to /usr/local/cellar/apache_spark/1.5.2
I also followed those threads and guide online as close as possible to get
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
export PATH=$PATH:$HADOOP_HOME/bin
PYTHONPATH=$SPARK_HOME/libexec/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
then I used this piece of code to test in the first thread I linked
from pyspark import SparkContext, SparkConf
import pymongo_spark
pymongo_spark.activate()
def main():
conf = SparkConf().setAppName('pyspark test')
sc = SparkContext(conf=conf)
if __name__ == '__main__':
main()
Then in the terminal, I did:
$SPARK_HOME/bin/spark-submit --jars $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-r1.4.2-1.4.2.jar --driver-class-path $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-r1.4.2-1.4.2.jar --master local[4] ~/Documents/pysparktest.py
Where mongo-hadoop-r1.4.2-1.4.2.jar
is the jar I built following this guide
I'm definitely missing things, but I'm not sure where/what I'm missing. I'm running everything locally on Mac OSX El Capitan. Almost sure this doesn't matter, but just wanna add it in.
EDIT:
I also used another jar file mongo-hadoop-1.5.0-SNAPSHOT.jar
, the same problem remains
my command:
$SPARK_HOME/bin/spark-submit --jars $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-1.5.0-SNAPSHOT.jar --driver-class-path $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-1.5.0-SNAPSHOT.jar --master local[4] ~/Documents/pysparktest.py