I have a python code which have the following 3rd party dependencies:
import boto3
from warcio.archiveiterator import ArchiveIterator
from warcio.recordloader import ArchiveLoadFailed
import requests
import botocore
from requests_file import FileAdapter
....
I have installed the dependencies using pip, and made sure that it was correctly installed by having the command pip list. Then, when I tried to submit the job to spark, I received the following errors:
ImportError: No module named 'boto3'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:395)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The problem of no module named not only happens with 'boto3' but also with other modules.
I tried the following things:
- Added SparkContext.addPyFile(".zip files")
- Using submit-spark --py-files
- Reinstall pip
- Made sure the path env variables
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
and installed pip install py4j - Used python instead of spark-submit
Software information:
- Python version: 3.4.3
- Spark version: 2.2.0
- Running on EMR-AWS: Linux version 2017.09