awswrangler write parquet dataframes to a single file

Question

I am creating a very big file that cannot fit in the memory directly. So I have created a bunch of small files in S3 and am writing a script that can read these files and merge them. I am using aws wrangler to do this

My code is as follows:

    try:
        dfs = wr.s3.read_parquet(path=input_folder, path_suffix=['.parquet'], chunked=True, use_threads=True)
        for df in dfs:
            path = wr.s3.to_parquet(df=df, dataset=True, path=target_path, mode="append")
            logger.info(path)
    except Exception as e:
        logger.error(e, exc_info=True)
        logger.info(e)

The problem is that w4.s3.to_parquet creates a lot of files, instead of writing in one file, also I can't remove chunked=True because otherwise my program fails with OOM

How do I make this write a single file in s3.

You are supposed to be able to use https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.s3.merge_datasets.html for this - but I can't get globbing to work — jtlz2, Mar 02 '23 at 10:11

score 3 · Answer 1 · answered Nov 05 '21 at 12:00

3

AWS Data Wrangler is writing multiple files because you have specified dataset=True. Removing this flag or switching to False should do the trick as long as you are specifying a full path

answered Nov 05 '21 at 12:00

Abdel Jaidi

306
1
6

score 0 · Answer 2 · edited Mar 04 '23 at 19:50

I don't believe this is possible. @Abdel Jaidi suggestion won't work as append=True requires dataset to be true or will throw an error. I believe that in this case, append has more to do with "appending" the data in Athena or Glue by adding new files to the same folder.

I also don't think this is even possible for parquet in general. As per this SO post it's not possible in a local folder, let alone S3. To add to this parquet is compressed and I don't think it would be easy to add a line to a compressed file without loading it all into memory.

I think the only solution is to get a beefy ec2 instance that can handle this.

I'm facing a similar issue and I think I'm going to just loop over all the small files and create bigger ones. For example, you could append sever dataframes together and then rewrite those but you won't be able to get back to one parquet file unless you get a computer with enough ram.

awswrangler write parquet dataframes to a single file

2 Answers2