4

I have a python code which have the following 3rd party dependencies:

import boto3
from warcio.archiveiterator import ArchiveIterator
from warcio.recordloader import ArchiveLoadFailed
import requests
import botocore
from requests_file import FileAdapter
....

I have installed the dependencies using pip, and made sure that it was correctly installed by having the command pip list. Then, when I tried to submit the job to spark, I received the following errors:

ImportError: No module named 'boto3'

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:395)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

The problem of no module named not only happens with 'boto3' but also with other modules.


I tried the following things:

  1. Added SparkContext.addPyFile(".zip files")
  2. Using submit-spark --py-files
  3. Reinstall pip
  4. Made sure the path env variables export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH and installed pip install py4j
  5. Used python instead of spark-submit

Software information:

  • Python version: 3.4.3
  • Spark version: 2.2.0
  • Running on EMR-AWS: Linux version 2017.09
COLD ICE
  • 840
  • 1
  • 12
  • 31

2 Answers2

3

Before doing spark-submit try going to python shell and try importing the modules. Also check which python shell (check python path) is opening up by default.

If you are able to successfully import these modules in python shell (same python version as you trying to use in spark-submit), please check following:

In which mode are you submitting the application? try standalone or if on yarn try client mode. Also try adding export PYSPARK_PYTHON=(your python path)

joshi.n
  • 489
  • 3
  • 7
3

All checks mentioned above worked ok but setting PYSPARK_PYTHON solved the issue for me.