2

I have tried to import another python file in my current pyspark program using Sparkcontext.It was giving me error as multiple spark context cannot run at once.Hence I am using spark session to import my python file. My code is :

spark = SparkSession.builder.appName('Recommendation_system').getOrCreate()
txt=spark.addFile('engine.py')
dataset_path = os.path.join('Musical_Instruments_5.json')
app = create_app(txt,dataset_path)

I am getting error as follows:

AttributeError: 'SparkSession' object has no attribute 'addFile'

What will be the correct way of importing python file using spark session.

Neha patel
  • 143
  • 2
  • 12

3 Answers3

2

You should use 'addFile' method of class:

  pyspark.SparkContext

API reference

computatma
  • 71
  • 5
0

The answer to this question might depend on Spark running in client or cluster mode, as mentioned in this SO answer. For Pyspark, an optimal solution might be to add the --py-files flag to the environment variable PYSPARK_SUBMIT_ARGS and this will work in any case. This can be done by pointing to your file as follows:

os.environ['PYSPARK_SUBMIT_ARGS'] = '--py-files "/path/to/file/engine.py" pyspark-shell'

You can even specify the path to a .zip containing multiple files as mentioned in the official Spark documentation here.

That works when using Pyspark from a notebook environment, for example. A more general solution might be to add this file using the option spark.submit.pyFiles inside the spark config file spark-defaults.conf. This will even work when running your job using spark-submit from the command line. Check the spark configuration options here for more information.

josescuderoh
  • 65
  • 10
0

I had the same issue yesterday, I wanted to write spark jobs in other files and wanted to run them by using singleton spark session so I did this: main.py

from pyspark.sql import SparkSession
from job1 import spark_job
import os


if __name__=="__main__":
    print("trying to start spark")

    spark = SparkSession.builder.master('local[*]').appName("PythonPi").getOrCreate()
    sc = spark.sparkContext
    # to be able to use paths.py in job's files
    sc.addPyFile(os.getcwd()+'/spark_jobs/paths.py')
    spark_job(spark, sc)
    spark.stop()

paths.py

import os


def get_path(file=__file__):
    return os.path.join(os.getcwd(),file)

job1.py

from paths import get_path

data = [('James','','Smith','1991-04-01','M',3000),
  ('Michael','Rose','','2000-05-19','M',4000),
  ('Robert','','Williams','1978-09-05','M',4000),
  ('Maria','Anne','Jones','1967-12-01','F',4000),
  ('Jen','Mary','Brown','1980-02-17','F',-1)
]

columns = ["firstname","middlename","lastname","dob","gender","salary"]

def spark_job(spark,sc):
    sc.addPyFile(get_path(__file__))
    df = spark.createDataFrame(data=data, schema = columns)
    df.createOrReplaceTempView("PERSON_DATA")
    groupDF = spark.sql("SELECT gender, count(*) from PERSON_DATA group by gender")
    groupDF.show()

This way I was able to call spark jobs from different files OR add py files in sparkContext

hope it helps someone.