When using PySpark to load multiple JSON files from S3 I get an error and the Spark job fails if a file is missing.
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input Pattern s3n://example/example/2017-02-18/*.json matches 0 files
This is how I add the 5 last days to my job with PySpark.
days = 5
x = 0
files = []
while x < days:
filedate = (date.today() - timedelta(x)).isoformat()
path = "s3n://example/example/"+filedate+"/*.json"
files.append(path)
x += 1
rdd = sc.textFile(",".join(files))
df = sql_context.read.json(rdd, schema)
How can I get PySpark to ignore the missing files and continue with the job?