3

When using PySpark to load multiple JSON files from S3 I get an error and the Spark job fails if a file is missing.

Caused by: org.apache.hadoop.mapred.InvalidInputException: Input Pattern s3n://example/example/2017-02-18/*.json matches 0 files

This is how I add the 5 last days to my job with PySpark.

days = 5
x = 0
files = []

while x < days:
    filedate = (date.today() - timedelta(x)).isoformat()
    path = "s3n://example/example/"+filedate+"/*.json"
    files.append(path)
    x += 1

rdd = sc.textFile(",".join(files))                      
df = sql_context.read.json(rdd, schema)

How can I get PySpark to ignore the missing files and continue with the job?

gevert
  • 682
  • 6
  • 8

1 Answers1

1

Use a function that tries to load the file, if the file is missing it fails and returns False.

from py4j.protocol import Py4JJavaError

def path_exist(sc, path):
    try:
        rdd = sc.textFile(path)
        rdd.take(1)
        return True
    except Py4JJavaError as e:
        return False

This lets you check if files are available before adding them to your list without having to use AWS Cli or S3 commands.

days = 5
x = 0
files = []

while x < days:
    filedate = (date.today() - timedelta(x)).isoformat()
    path = "s3n://example/example/"+filedate+"/*.json"
    if path_exist(sc, path):
        files.append(path)
    else:
        print('Path does not exist, skipping: ' + path)
    x += 1

rdd = sc.textFile(",".join(files))                      
df = sql_context.read.json(rdd, schema)

I found this solution at http://www.learn4master.com/big-data/pyspark/pyspark-check-if-file-exists

gevert
  • 682
  • 6
  • 8
  • 1
    This is a rather bad answer. This only works for non production system that only reads a small amount of data. Imagine doing this on a 50GB folder. Reading the entire files to just return true or false? – Alexis Nov 26 '20 at 07:50