1

I am trying to run a pyspark unit test in Visual studio code on my local windows machine. when i debug the test it gets stuck at line where I am creating a sparksession. It doesn't show any error/failure but status bar just shows "Running Tests" . Once it work, i can refactor my test to create sparksession as part of test fixture, but presently my test is getting stuck at sparksession creation.

Do i have to install/configure on my local machine for sparksession to work?

i tried a simple test with assert 'a' == 'b' and i can debug and test run succsfully, so i assume my pytest configurations are correct. Issue i am facing is with creating sparksession.

# test code

from pyspark.sql import SparkSession, Row, DataFrame

import pytest

def test_poc():
   spark_session = SparkSession.builder.master('local[2]').getOrCreate()  #this line never returns when debugging test.
   spark_session.createDataFrame(data,schema) #data and schema not shown here.

Thanks

Kafels
  • 3,864
  • 1
  • 15
  • 32
user9297554
  • 347
  • 4
  • 17
  • Looks like issue is not with Visual studio code but, pyspark. As i have having same issue when running pytest from command line. created a separate question for that - https://stackoverflow.com/questions/69215648/pytest-for-creating-sparksession-on-local-machine – user9297554 Sep 16 '21 at 22:00
  • from stackoverflow.com/questions/53217767/…. i checked my machine and saw there was some old installation of spark(sparkhome environment variable was present on machine) but the JavaHome was pointing to wrong directory. I removed all environment variables related to Spark/Python and it is working after that. So I assume spark installation is not required on local machine to create SparkSession. Just installing pyspark package is sufficient – user9297554 Sep 17 '21 at 21:22

1 Answers1

1

What I have done to make it work was:

  1. Create a .env file in the root of the project

  2. Add the following content to the created file:

SPARK_LOCAL_IP=127.0.0.1
JAVA_HOME=<java_path>/jdk/zulu@1.8.192/Contents/Home
SPARK_HOME=<spark_path>/spark-3.0.1-bin-hadoop2.7
PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
  1. Go to .vscode file in the root, expand and open settings.json. Add the following like (replace <workspace_path> with your actual workspace path):
"python.envFile": "<workspace_path>/.env"

After refreshing the Testing section in Visual Studio Code, the setup should succeed.

Note: I use pyenv to setup my python version, so I had to make sure that VS Code was using the correct python version with all the expected dependencies installed.

Solution inspired by py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM and https://github.com/microsoft/vscode-python/issues/6594

user3582348
  • 123
  • 1
  • 4
  • I was not hopeful that this would resolve some of the issues I was having with running PySpark code in the VS Code Testing bar, but it did the trick. Thanks! – dzubke Aug 31 '22 at 14:18