0

Big-data newb here, though many years software engineering experience.

I have several TB of data in gzip compressed JSON files, from which I want to extract some subset of relevant data and store as parquet files within S3 for further analysis and possible transformation.

The files vary in (compressed) size from a few MB to some tens of GB each.

For production purposes I plan on doing the ETL with PySpark in AWS Glue; for exploratory purposes I am playing around in Google Colab.

I thought at first to just put the gzipped JSON files into a folder and read them into a Spark dataframe and perform whatever transformations I needed.

df_test = spark.read.option("multiline", "true").json('/content/sample_data/test_files/*')
df_test.printSchema()
df_test = df_test.select(explode("in_scope").alias("in_scope"))
df_test.count()

To my surprise, even a single relatively small file (16MB compressed) resulted in a memory footprint of nearly 10GB (according to the RAM tooltip in the Colab notebook), which made me try to search around for answers and options. However, information on SO and Medium and other sites made things more confusing (possibly because they're written at different points in time).

Questions

  1. What might be the cause for the high memory usage for such a small file?
  2. Would it be more efficient to unzip the files using plain old Python or even a linux script, and then process the unzipped JSON files with PySpark?
  3. Would it be still more efficient to unzip the files in Python and rewrite the desired JSON objects from the in_scope array as JSONL (newline-delimited JSON) files and process the unzipped JSONL files with PySpark?
teejay
  • 103
  • 8

2 Answers2

0
  1. How large are the unzipped files? Gzip does a great job of compressing json and text. When you load the gzip files, spark will uncompress and keep the results in ‘in memory’
  2. Either your process or spark must pay the price of unzipping the file. And unfortunately you cannot filter the relevant data until after unzipping, which leads us to:
  3. What would be most efficient is to partition the input data, and filter on read, as posted here: Using predicates to filter rows from pyarrow.parquet.ParquetDataset
Papara
  • 21
  • 4
  • 1. A 16MB gz file -> 674MB uncompressed, while 33MB gz -> 1.46 GB uncompressed. If these are representative of the whole, looks like ~ 42:1 compression. 2. "Either process or spark must pay the price of unzipping" - yes, understood. My question is if anyone has recommendations on which would be more efficient. 2. Sorry I don't completely understand your suggestion to partition the input data. My input data (which I don't control) is gzipped JSON. I'll partition when I save as parquet, but my question is really about the most efficient way to ingest gzipped JSON. – teejay Jun 07 '23 at 02:29
  • gzip is a bad format for processing as you have to unzip the whole file to seek around in it. Try recompressing in snappy before trying to work on the data. also, json is very inefficient; ideally make step 1 "convert to a better format", maybe using snappy again – stevel Jun 07 '23 at 14:43
0

For the curious, coming back to this a month later to share what I ended up doing...

@stevel's comment to @Papara's answer pointed me in the right direction. I ended up using a sax-style JSON parser (I used jsonslicer but there are others) to split the file into individual JSON objects, utilizing smart-open to abstract away the handling of compression, cloud v/s local filestorage, etc. jsonslicer yields individual JSON objects which I accumulate into a deque. When it reaches a certain threshold, I use pyarrow.RecordBatch.from_pylist to create a record batch from the deque and then pyarrow.parquet.ParquetWriter.write_batch to write the batch to a parquet file. I continue iterating and writing batches until the file is completely processed.

The batch size needs to be tuned based on the size of the JSON objects to keep the memory footprint at a desired size.

For my particular data (very large individual JSON objects) I am ending up with somewhat inefficient parquet files with small row-groups, so I need a downstream step to compact these... but that's another story!

teejay
  • 103
  • 8