1

I am creating a very big file that cannot fit in the memory directly. So I have created a bunch of small files in S3 and am writing a script that can read these files and merge them. I am using aws wrangler to do this

My code is as follows:

    try:
        dfs = wr.s3.read_parquet(path=input_folder, path_suffix=['.parquet'], chunked=True, use_threads=True)
        for df in dfs:
            path = wr.s3.to_parquet(df=df, dataset=True, path=target_path, mode="append")
            logger.info(path)
    except Exception as e:
        logger.error(e, exc_info=True)
        logger.info(e)

The problem is that w4.s3.to_parquet creates a lot of files, instead of writing in one file, also I can't remove chunked=True because otherwise my program fails with OOM

How do I make this write a single file in s3.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • Hi @Nirav Nagda did you solve this issue? – bigdataadd Dec 09 '21 at 12:27
  • You are supposed to be able to use https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.s3.merge_datasets.html for this - but I can't get globbing to work – jtlz2 Mar 02 '23 at 10:11

2 Answers2

3

AWS Data Wrangler is writing multiple files because you have specified dataset=True. Removing this flag or switching to False should do the trick as long as you are specifying a full path

Abdel Jaidi
  • 306
  • 1
  • 6
0

I don't believe this is possible. @Abdel Jaidi suggestion won't work as append=True requires dataset to be true or will throw an error. I believe that in this case, append has more to do with "appending" the data in Athena or Glue by adding new files to the same folder.

I also don't think this is even possible for parquet in general. As per this SO post it's not possible in a local folder, let alone S3. To add to this parquet is compressed and I don't think it would be easy to add a line to a compressed file without loading it all into memory.

I think the only solution is to get a beefy ec2 instance that can handle this.

I'm facing a similar issue and I think I'm going to just loop over all the small files and create bigger ones. For example, you could append sever dataframes together and then rewrite those but you won't be able to get back to one parquet file unless you get a computer with enough ram.

jtlz2
  • 7,700
  • 9
  • 64
  • 114
Kai Lukowiak
  • 63
  • 1
  • 6