1

I am using Spark 2.4.0 on an AWS cluster. The purpose is ETL and it is heavily based on Spark SQL using pyspark. I have a multitude of python scripts that are invoked in sequence. There are data dependencies between these scripts. There is a main.py that invokes other scripts like process1.py, process2.py etc.

The invocation is done using:

#invoking process 1
command1 = "/anaconda3/bin/python3  process1.py"
p1 = subprocess.Popen(command1.split(" "), stdout=PIPE, stderr=subprocess.STDOUT)
p.wait() 

#invoking process 2
command2 = "/anaconda3/bin/python3  process2.py"
p2 = subprocess.Popen(command2.split(" "), stdout=PIPE, stderr=subprocess.STDOUT)
p.wait() 

Each of these processes (process1.py, process2.py etc ) are doing dataframe transformations using SQL based syntax like:

df_1. createGlobalTempView ('table_1')
result_1 = spark.sql('select * from table_1 where <some conditions>')

The challenge is that I want dataframes (like df_1 or result_1) and/or tables (like table_1) to be accessible across the processing sequence. So, for example if the code above is in process1.py the generated df_1 or table_1 to be accessible in process2.py.

Main.py, process1.py and process2.py are getting the spark session using:

spark = SparkSession.builder.appName("example-spark").config("spark.sql.crossJoin.enabled", "true").getOrCreate()

I know there is the option to use HIVE for storing table_1 but I am trying to avoid if possible this scenario.

Thanks a lot for help!

Andronicus
  • 25,419
  • 17
  • 47
  • 88
dagspark
  • 11
  • 1
  • Possible duplicate of [How to cache a Spark data frame and reference it in another script](https://stackoverflow.com/questions/35583493/how-to-cache-a-spark-data-frame-and-reference-it-in-another-script) – 10465355 Jan 18 '19 at 18:33
  • Not really, because @dagspark is already using global views – Andronicus Jan 18 '19 at 19:50
  • Do both Spark jobs have access to the same metastore? If you aren't using Hive as a centralized metastore, you have to make sure that the local metastore is accessible to both jobs, and also correctly synced. I would recommend debugging along that avenue. – Rick Moritz Jan 19 '19 at 17:01
  • Rick - yes I checked (with spark.catalog.listDatabases()) and both of them are using the same METASTORE:::[Database(name='default', description='default database', locationUri='file:/Users/me/Desktop/Spark-stuff/spark-warehouse')]. Everything seems to be in the right place but still not resolution. Thanks for help. – dagspark Jan 21 '19 at 15:58
  • Actually [this](https://stackoverflow.com/questions/41491972/how-can-i-tear-down-a-sparksession-and-create-a-new-one-within-one-application) provides and implicit answer ... – dagspark Jan 23 '19 at 20:32

0 Answers0