Running Pyspark on Pycharm

Question

On a Mac (v. 10.14.5), I am trying to run PySpark programs in PyCharm (professional edition, v. 19.2).

I know my simple PySpark program is fine, because when I run it with spark-submit outside PyCharm from the terminal, using Spark I installed via brew, it works as expected. I have tried linking PyCharm to this version of Spark, but am getting other issues.

I followed multiple instructions online to install pyspark within Pycharm (Preferences -> Project Interpreter), and set the SPARK_HOME environment variable to the appropriate venv directory (Run -> Edit Configurations -> Environment Variables). For example, this stackoverflow thread. But, I get an error message when I run the program:

Failed to find Spark jars directory (/Users/rahul/PycharmProjects/spark-demoII/venv/assembly/target/scala-2.12/jars).
You need to build Spark with the target "package" before running this program.
Traceback (most recent call last):
  File "/Users/rahul/PycharmProjects/spark-demoII/run.py", line 6, in <module>
    sc = SparkContext("local", "SimpleApp")
  File "/Users/rahul/virtualenvs/pyspark/lib/python3.7/site-packages/pyspark/context.py", line 133, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File "/Users/rahul/virtualenvs/pyspark/lib/python3.7/site-packages/pyspark/context.py", line 316, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "/Users/rahul/virtualenvs/pyspark/lib/python3.7/site-packages/pyspark/java_gateway.py", line 46, in launch_gateway
    return _launch_gateway(conf)
  File "/Users/rahul/virtualenvs/pyspark/lib/python3.7/site-packages/pyspark/java_gateway.py", line 108, in _launch_gateway
    raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number

Process finished with exit code 1

Anyone know how to get PyCharm to run Pyspark programs on a similar machine?

In response to @pissal suggestion: I tried that previously but that version of spark does work. I tried it again anyway: after switching to a virtual environment, I did a pip install pyspark. To ensure that this version of spark works, I ran a spark-submit run.py (outside of PyCharm), and here is the error message.

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/Users/rahul/.virtualenvs/test1/lib/python3.7/site-packages/pyspark/jars/spark-unsafe_2.11-2.4.4.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Exception in thread "main" java.lang.ExceptionInInitializerError
    at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:80)
    at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
    at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
    at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
    at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
    at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
    at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
    at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
    at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2422)
    at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79)
    at org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:348)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$secMgr$1(SparkSubmit.scala:348)
    at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
    at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
    at scala.Option.map(Option.scala:146)
    at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:355)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
    at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3720)
    at java.base/java.lang.String.substring(String.java:1909)
    at org.apache.hadoop.util.Shell.<clinit>(Shell.java:52)
    ... 25 more

try this : https://stackoverflow.com/questions/34685905/how-to-link-pycharm-with-pyspark — Subbu VidyaSekar, Oct 17 '19 at 09:57
Right that was one of the links with directions I followed, but did not work on my machine. Thanks anyway - I've updated the question to show this link. — Rahul Shetty, Oct 17 '19 at 10:19
I am using pycharm too, but i don't need to set environment variable for the same. I set the `Project Interpreter` as my virtualenv python. And just a `pip install pyspark`. I have my `SPARK_HOME` environment var in my zshrc pointing to a manual installation I've done with a tar ball — pissall, Oct 17 '19 at 10:22
I tried that previously but that version of spark does work. I tried it again anyway: after switching to a virtual environment, I did a `pip install pyspark`. To ensure that this version of spark works, I ran a `spark-submit run.py` (outside of **PyCharm**), and the error message is shown in the main comment. — Rahul Shetty, Oct 17 '19 at 11:45

score 0 · Answer 1 · answered Oct 17 '19 at 15:25

So the reason this was happening was that pyspark has not been updated to use the latest version of Java. After removing Java version 13, I made sure my home brew installation of spark uses java version 1.8. Then added the following to the Environment Variables in Run -> Edit Configurations in Pycharm:

SPARK_HOME=/usr/local/Cellar/apache-spark/2.4.4/libexec

With these settings I can run pyspark jobs in PyCharm.

Running Pyspark on Pycharm

1 Answers1