Big-data newb here, though many years software engineering experience.
I have several TB of data in gzip compressed JSON files, from which I want to extract some subset of relevant data and store as parquet files within S3 for further analysis and possible transformation.
The files vary in (compressed) size from a few MB to some tens of GB each.
For production purposes I plan on doing the ETL with PySpark in AWS Glue; for exploratory purposes I am playing around in Google Colab.
I thought at first to just put the gzipped JSON files into a folder and read them into a Spark dataframe and perform whatever transformations I needed.
df_test = spark.read.option("multiline", "true").json('/content/sample_data/test_files/*')
df_test.printSchema()
df_test = df_test.select(explode("in_scope").alias("in_scope"))
df_test.count()
To my surprise, even a single relatively small file (16MB compressed) resulted in a memory footprint of nearly 10GB (according to the RAM tooltip in the Colab notebook), which made me try to search around for answers and options. However, information on SO and Medium and other sites made things more confusing (possibly because they're written at different points in time).
Questions
- What might be the cause for the high memory usage for such a small file?
- Would it be more efficient to unzip the files using plain old Python or even a linux script, and then process the unzipped JSON files with PySpark?
- Would it be still more efficient to unzip the files in Python and rewrite the desired JSON objects from the
in_scope
array as JSONL (newline-delimited JSON) files and process the unzipped JSONL files with PySpark?