1

I am running a PySpark job that reads data from a file if it exists and if not creates an empty dataframe which then gets written out as a file for the next time the job runs.

I have had the same code working in different jobs but for this one. Whenever I run it for the second time, even though when there is a file present, it will throw error that there is no file present and then even delete it.

Any information will be helpful. Thanks.

def load_master_logs(spark, master_path):
    # verify master file exists, if not, create one with headers
    file_mask = "part*.csv"
    if glob.glob(os.path.join(master_path, file_mask)):
        master_file = glob.glob(os.path.join(master_path, file_mask))[0]
        master_df = spark.read.csv(master_file, header=True, schema=MASTER_SCHEMA)
    else:
        log_and_send_to_slack("No existing master file found creating new one")
        master_df = spark.createDataFrame([], schema=MASTER_SCHEMA)
    master_df.cache()
    return master_df
Rishabh Agarwal
  • 1,988
  • 1
  • 16
  • 33
AJR
  • 177
  • 2
  • 7

1 Answers1

0

So I got it working in the end, seems to be the same as Spark SQL SaveMode.Overwrite, getting java.io.FileNotFoundException and requiring 'REFRESH TABLE tableName' and I was able to solve it by creating a temporary directory to write to then deleting all files in desired directory and then copying accross from temp. No idea why that works though and deleting and recreating the folder doesn't, so would be keen to know the logic from anyone who understands the underlying code enough. Also still don't know why the original code works fine on a different job?

AJR
  • 177
  • 2
  • 7