I have multiple files in S3.
s3 = boto3.client("s3")
paginator = s3.get_paginator("list_objects_v2")
for page in paginator.paginate(Bucket='cw-milenko-tests', Prefix='Json_gzips'):
contents = page["Contents"]
for c in contents:
if (c['Key']).startswith('Json_gzips/tick_calculated_3'):
print(c['Key'])
Output
Json_gzips/tick_calculated_3_2020-05-27T00-05-51.json.gz
Json_gzips/tick_calculated_3_2020-05-27T00-13-23.json.gz
Json_gzips/tick_calculated_3_2020-05-27T00-17-36.json.gz
Json_gzips/tick_calculated_3_2020-05-27T00-28-10.json.gz
Json_gzips/tick_calculated_3_2020-05-27T00-30-43.json.gz
Json_gzips/tick_calculated_3_2020-05-27T00-34-56.json.gz
Json_gzips/tick_calculated_3_2020-05-27T00-38-29.json.gz
I want to real all these files into Spark Data Frame,then perform union and save that as a single parquet file. I already asked define schema for PySpark and got usefull answer. How should i edit my code
data = spark.read.json(c['Key'])
data should be saved to big_data =[data1,data2,...]
That would enable me union.
bigdf = reduce(DataFrame.unionAll, big_data)
How to fix this?