2

I am trying to install PySpark on Colab.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.1/spark-2.4.1-bin-hadoop2.7.tgz
!tar xf spark-2.4.1-bin-hadoop2.7.tgz
!pip install -q findspark

After installing above things, I set the environment as following:

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.2.1-bin-hadoop2.7"

After that, I tried to initialized pyspark as follows and end up with error.

import findspark
findspark.init()

Error:

IndexError                                Traceback (most recent call last)

<ipython-input-24-4e91d34768ac> in <module>()
      1 import findspark
----> 2 findspark.init()

/usr/local/lib/python3.6/dist-packages/findspark.py in init(spark_home, python_path, edit_rc, edit_profile)
    133     # add pyspark to sys.path
    134     spark_python = os.path.join(spark_home, 'python')
--> 135     py4j = glob(os.path.join(spark_python, 'lib', 'py4j-*.zip'))[0]
    136     sys.path[:0] = [spark_python, py4j]
    137 

IndexError: list index out of range
Munna
  • 21
  • 3
  • Possible duplicate of [findspark.init() IndexError: list index out of range error](https://stackoverflow.com/questions/42223498/findspark-init-indexerror-list-index-out-of-range-error) – pault Apr 18 '19 at 15:04
  • 2
    @pault yes it may be but I saw that too and didn't solve. and moreover, i set it on google colab and set the environment properly i guess. – Munna Apr 18 '19 at 15:12
  • 1
    Having the trouble in google colab, the solution in possible duplicate does not work – Rishiraj Purohit May 31 '19 at 09:16

2 Answers2

2

Can you try setting the

os.environ["SPARK_HOME"] = "/content/spark-2.2.1-bin-hadoop2.7"

to the same spark version as your above install? In your case it would be 2.4.1 not 2.2.1.

os.environ["SPARK_HOME"] = "/content/spark-2.4.1-bin-hadoop2.7"
firtree
  • 21
  • 2
1

Make sure that your Java and Spark paths (including version) are correct:

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

Then try to check if the path is correct by printing the path 

print(os.listdir('./sample_data'))

If you get a list of sample files, the code will initialize without any 'index out of range' errors

Neville Lusimba
  • 827
  • 1
  • 8
  • 10