I am running a PySpark job that reads data from a file if it exists and if not creates an empty dataframe which then gets written out as a file for the next time the job runs.
I have had the same code working in different jobs but for this one. Whenever I run it for the second time, even though when there is a file present, it will throw error that there is no file present and then even delete it.
Any information will be helpful. Thanks.
def load_master_logs(spark, master_path):
# verify master file exists, if not, create one with headers
file_mask = "part*.csv"
if glob.glob(os.path.join(master_path, file_mask)):
master_file = glob.glob(os.path.join(master_path, file_mask))[0]
master_df = spark.read.csv(master_file, header=True, schema=MASTER_SCHEMA)
else:
log_and_send_to_slack("No existing master file found creating new one")
master_df = spark.createDataFrame([], schema=MASTER_SCHEMA)
master_df.cache()
return master_df