External dependency for spark job

Question

I am new to big data technologies.I have to run a spark job in cluster mode on EMR. The job is written in python and it has dependencies on several libraries and some other tools. I have already written the script and run it in local client mode.But it arising some dependency issue when I am trying to run it using yarn.How do I manage these dependencies?

Log :

"/mnt/yarn/usercache/hadoop/appcache/application_1511680510570_0144/container_1511680510570_0144_01_000002/pyspark.zip/pyspark/cloudpickle.py", line 711, in subimport
    __import__(name)
ImportError: ('No module named boto3', <function subimport at 0x7f8c3c4f9c80>, ('boto3',))

        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Could you provide the errors messages here? – Subash Dec 12 '17 at 07:24 — Subash, Dec 12 '17 at 07:24
I have added the error message – Md khirul ashik Dec 12 '17 at 08:43 — Md khirul ashik, Dec 12 '17 at 08:43

Subash · Accepted Answer · 2017-12-14T05:09:33.397

1

It seems you have not installed Boto 3 library. Download the compatible one and install it using below

$ pip install boto3

or python -m pip install --user boto3

Hope this helps.You can refer the link-https://github.com/boto/boto3

Then it seems you have not installed the boot 3 on all executors(nodes). Since, you are running spark, python code is running partly on driver and executors.You need to install the library in all nodes if its yarn.

To install the same.Please refer-How to bootstrap installation of Python modules on Amazon EMR?

edited Dec 14 '17 at 05:09

answered Dec 12 '17 at 09:16

Subash

887
1
8
19

I have installed the boto3 and run the job in local client mode using. It was working fine. I am facing this problem while using it with yarn. – Md khirul ashik Dec 12 '17 at 09:26
Did boto3 installation on all nodes worked for you? – Subash Dec 15 '17 at 07:24
Thanks for your help. I have solved the issue. Is there any way to add bootstrap action on a running cluster using cli ? – Md khirul ashik Dec 17 '17 at 11:02

Subash · Answer 2 · 2017-12-18T14:39:20.193

0

Yes you can-

aws emr create-cluster --bootstrap-actions Path=<>,Name=BootstrapAction1,Args=[arg1,arg2].. --auto-terminate.Please refer below-http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-bootstrap.html#bootstrapUses

edited Dec 18 '17 at 14:39

answered Dec 18 '17 at 11:33

Subash

887
1
8
19

External dependency for spark job

2 Answers2