1

I am trying to run a test for my pyspark code on windows local machine. Pytest is getting stuck at line where I am creating SparkSession in my test code. Do i have to install/configure spark on my local machine for Pytest to work. Finally the test will execute as part of CI/CD, do i have to configure Spark on build machines also? I have a related question, but looks like issue is not with Visual studio Code but pytest (as i have same issue when I run pytest from command line )

below is my test code

# test code

from pyspark.sql import SparkSession, Row, DataFrame

import pytest

def test_poc():
   spark_session = SparkSession.builder.master('local[2]').getOrCreate()  #this line never returns when debugging test.
   spark_session.createDataFrame(data,schema) #data and schema not shown here.
user9297554
  • 347
  • 4
  • 17

1 Answers1

0

can you add the terminal output of your pyspark script? It will be helpful to understand where to begin with and it might give us a clue what it is the problem in your setup.

At least to see if you have installed pyspark correctly (you still might need to do additional operations to be fully sure), but you can do like below script saved in a python file sample_test.py

from pyspark import sql


spark = sql.SparkSession.builder \
        .appName("local-spark-session") \
        .getOrCreate()
        

And running it should print out something like below

C:\Users\user\Desktop>python sample_test.py
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

C:\Users\user\Desktop>SUCCESS: The process with PID 16368 (child process of PID 12664) has been terminated.
SUCCESS: The process with PID 12664 (child process of PID 11736) has been terminated.
SUCCESS: The process with PID 11736 (child process of PID 6800) has been terminated.

And below is a sample test for pyspark using pytest saved in a file called sample_test.py

from pyspark import sql


spark = sql.SparkSession.builder \
        .appName("local-spark-session") \
        .getOrCreate()
        

def test_create_session():
    assert isinstance(spark, sql.SparkSession) == True
    assert spark.sparkContext.appName == 'local-spark-session'
    assert spark.version == '3.1.2'

Which you can simply run as below

C:\Users\user\Desktop>pytest -v sample_test.py
============================================= test session starts =============================================
platform win32 -- Python 3.6.7, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- c:\users\user\appdata\local\programs\python\python36\python.exe
cachedir: .pytest_cache
rootdir: C:\Users\user\Desktop
collected 1 item

sample_test.py::test_create_session PASSED                                                               [100%]

============================================== 1 passed in 4.51s ==============================================

C:\Users\user\Desktop>SUCCESS: The process with PID 4752 (child process of PID 9780) has been terminated.
SUCCESS: The process with PID 9780 (child process of PID 8988) has been terminated.
SUCCESS: The process with PID 8988 (child process of PID 20176) has been terminated.

Above example is for windows. My account is new so I couldn't respond on your comments...can you update your question to share the messages/errors from the terminal if there are any? And by the way just wondering what OS are you using?

  • from https://stackoverflow.com/questions/53217767/py4j-protocol-py4jerror-org-apache-spark-api-python-pythonutils-getencryptionen. i checked my machine and saw there was some old installation of spark(sparkhome environment variable was present on machine) but the JavaHome was pointing to wrong directory. I removed all environment variables related to Spark/Python and it is working after that. So I assume spark installation is not required on local machine to create SparkSession. Just installing pyspark package is sufficient. – user9297554 Sep 17 '21 at 21:22