7

I'm trying to learn Spark following some hello-word level example such as below, using pyspark. I got a "Method isBarrier([]) does not exist" error, full error included below the code.

from pyspark import SparkContext

if __name__ == '__main__':
    sc = SparkContext('local[6]', 'pySpark_pyCharm')
    rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8])
    rdd.collect()
    rdd.count()

enter image description here

Although, when I start a pyspark session in command line directly and type in the same code, it works fine:

enter image description here

My setup:

  • windows 10 Pro x64
  • python 3.7.2
  • spark 2.3.3 hadoop 2.7
  • pyspark 2.4.0
Indominus
  • 1,228
  • 15
  • 31

3 Answers3

5

The problem is incompatibility between versions of Spark JVM libraries and PySpark. In general PySpark version has to exactly match the version of your Spark installation (while in theory matching major and minor versions should be enough, some incompatibilities in maintenance releases have been introduced in the past).

In other words Spark 2.3.3 is not compatible with PySpark 2.4.0 and you have to either upgrade Spark to 2.4.0 or downgrade PySpark to 2.3.3.

Overall PySpark is not designed to be used a standalone library. While PyPi package is a handy development tool (it is often easier to just install a package than manually extend the PYTHONPATH), for actual deployments it is better to stick with the PySpark package bundled with actual Spark deployment.

10465355
  • 4,481
  • 2
  • 20
  • 44
  • It workded, thanks! Although, that's kinda crazy, how can a package depends on the exact version of another, that's like zero backward compatibility. Do you think it's Spark's problem or PySpark's? By default I just assume it's PySpark's issue per the dependency direction. – Indominus Mar 04 '19 at 19:05
  • 2
    I wouldn't say so. PySpark and Spark as such are not considered separate projects and release and development cycles are tightly connected. If there is a problem, it is more in a confusing status of PySpark package (or other guest language libraries). These are provided to make development easier (as manual configuration can be tedious, see for example [in Python](https://stackoverflow.com/q/34685905/10465355) or [in SparkR](https://stackoverflow.com/q/31184918/10465355)), but were never intended for production. – 10465355 Mar 04 '19 at 19:15
  • I see, "not considered separate projects", that makes sense then. – Indominus Mar 04 '19 at 19:33
1

Try starting your python script/session with

import findspark
findspark.init()

that updates sys.path based on the spark installation directory. Worked for me.

iggy
  • 662
  • 6
  • 14
0

Try to use Java 8(instead of newer versions) and also install findspark using

pip install findspark

Then try to import this at the beginning of your python script /session

import findspark
findspark.init()
from pyspark import SparkContext

This worked for me !