I have 999 gz files located in s3 bucket. I wanted to read them all and convert pyspark dataframe into pandas dataframe, it was impossible due to large files. I am trying to take a different approach - reading each single /gz file THEN convert it to pandas df - reduce number of columns and then concatenate it into one big pandas df.
spark_df = spark.read.json(f"s3a://my_bucket/part-00000.gz")
part-000000.gz - this is zipped json, 0000 is the first one and 00999 is the last one to read. COuld you please help me unpack them all and later on concatenate pandas df.
Logic:
Read all json files:
spark_df = spark.read.json(f"s3a://my_bucket/part-00{}.gz")
convert to pandas
pandas_df = spark_df.toPandas()
reduce columns (only few needed column)
pandas_df = pandas_df[["col1","col2","col3"]]
merge all the 999 pandas df into one full_df = pd.concat(for loop, to go through all the pandas dataframes)
This is the logic in my head, but I have difficulties to code it.
EDIT: I started writing the code but it does not show me pandas_df:
for i in range(10,11):
df_to_predict = spark.read.json(f"s3a://my_bucket/company_v20_dl/part-000{i}.gz")
df_to_predict = df_to_predict.select('id','summary', 'website')
df_to_predict = df_to_predict.withColumn('text', lower(col('summary')))
df_to_predict = df_to_predict.select('id','text', 'website')
df_to_predict = df_to_predict.withColumn("text_length", length("text"))
df_to_predict.show()
pandas_df = df_to_predict.toPandas()
pandas_df.head()
Also I've notice this solution will be faulty for part00001 / part00100 etc <- range does not "fill up" with zeros.