0

I am trying to run example pi.py using spark-submit but I am getting following error,

Python 3.6.5
[GCC 4.3.4 [gcc-4_3-branch revision 152973]] on linux
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
  File "/var/lib/spark/python/pyspark/shell.py", line 31, in <module>
    from pyspark import SparkConf
  File "/var/lib/spark/python/pyspark/__init__.py", line 110, in <module>
    from pyspark.sql import SQLContext, HiveContext, Row
  File "/var/lib/spark/python/pyspark/sql/__init__.py", line 45, in <module>
    from pyspark.sql.types import Row
  File "/var/lib/spark/python/pyspark/sql/types.py", line 27, in <module>
    import ctypes
  File "Python-3.6.5_suse/lib/python3.6/ctypes/__init__.py", line 7, in <module>
    from _ctypes import Union, Structure, Array
ImportError: libffi.so.4: cannot open shared object file: No such file or directory

I am new to python and spark but when I set PYSPARK_PYTHON path in spark-defaults.sh to some older version of python like 3.3.x then it works perfectly fine.

am I setting anything wrong or I do need any other library? This looks like libraries issue.

Thanks!

nullptr
  • 87
  • 1
  • 9
  • Have you tried to use virtual environment and then install the library though `pip` within it? – Ninja Warrior 11 May 26 '18 at 19:07
  • What version of spark are you using and what kind of cluster do you have ? – eliasah May 26 '18 at 20:20
  • Sorry for not being clear earlier. I am using spark 2.3 on yarn cluster - Hadoop 2.6 – nullptr May 27 '18 at 18:03
  • @Ninja Warrior 11, I have not tried installing library using pip. Which library to add? – nullptr May 27 '18 at 18:07
  • Possible duplicate of [How do I install pyspark for use in standalone scripts?](https://stackoverflow.com/questions/25205264/how-do-i-install-pyspark-for-use-in-standalone-scripts) – Ninja Warrior 11 May 27 '18 at 19:50
  • It's advisable to install modules and run your application through a virtual environment than installing library packages within the main Python interpreter. Your problem has already an answer referring to the link I marked as flag, “Spark-2.2.0 onwards use `pip install pyspark` to install pyspark in your machine.” – Ninja Warrior 11 May 27 '18 at 19:54
  • Okay thanks @ninja-warrior-11 I will try installing pyspark using pip but I still don't understand why python 3.3 would work on spark 2.3 and not python 3.6. – nullptr May 27 '18 at 21:54

1 Answers1

0

I found what the problem was! My small yarn cluster has different OS hosts some suse's some centos's and when I set the PYSPARK_PYTHON in the spark-env.sh that configuration was having a central python path so the libraries weren't matching and it was throwing the libffi.so error. So, checking the type of host OS against the lib python path was helpful. Once I set the correct path and run,

./bin/spark-submit --deploy-mode client examples/src/main/python/pi.py

then I could verify the local libraries were set properly. I didn't need to install any additional python libraries such as pyspark or py4j as suggested in comments or other answers.

nullptr
  • 87
  • 1
  • 9