1

I am running into problem running my script with spark-submit. The main script won't even run because import pymongo_spark returns ImportError: No module named pymongo_spark

I checked this thread and this thread to try to figure out the issue, but so far there's no result.

My setup:

$HADOOP_HOME is set to /usr/local/cellar/hadoop/2.7.1 where my hadoop files are

$SPARK_HOME is set to /usr/local/cellar/apache_spark/1.5.2

I also followed those threads and guide online as close as possible to get

export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH

export PATH=$PATH:$HADOOP_HOME/bin

PYTHONPATH=$SPARK_HOME/libexec/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH

then I used this piece of code to test in the first thread I linked

from pyspark import SparkContext, SparkConf
import pymongo_spark

pymongo_spark.activate()

def main():
    conf = SparkConf().setAppName('pyspark test')
    sc = SparkContext(conf=conf)
if __name__ == '__main__':
    main()

Then in the terminal, I did:

$SPARK_HOME/bin/spark-submit --jars $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-r1.4.2-1.4.2.jar --driver-class-path $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-r1.4.2-1.4.2.jar --master local[4] ~/Documents/pysparktest.py

Where mongo-hadoop-r1.4.2-1.4.2.jar is the jar I built following this guide

I'm definitely missing things, but I'm not sure where/what I'm missing. I'm running everything locally on Mac OSX El Capitan. Almost sure this doesn't matter, but just wanna add it in.

EDIT:

I also used another jar file mongo-hadoop-1.5.0-SNAPSHOT.jar, the same problem remains

my command:

$SPARK_HOME/bin/spark-submit --jars $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-1.5.0-SNAPSHOT.jar --driver-class-path $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-1.5.0-SNAPSHOT.jar --master local[4] ~/Documents/pysparktest.py
Community
  • 1
  • 1
JChao
  • 2,178
  • 5
  • 35
  • 65

1 Answers1

4

pymongo_spark is available only on mongo-hadoop 1.5 so it won't work with mongo-hadoop 1.4. To make it available you have to add directory with Python package to the PYTHONPATH as well. If you've built package by yourself it is located in spark/src/main/python/.

export PYTHONPATH=$PYTHONPATH:$MONGO_SPARK_SRC/src/main/python

where MONGO_SPARK_SRC is a directory with Spark Connector source.

See also Getting Spark, Python, and MongoDB to work together

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935
  • So that's why..... I got it to work (at least the import error doesn't show up now). thanks a lot! – JChao Jan 14 '16 at 02:27