How do I merge several parquet files into one using awswrangler?

Asked Mar 02 '23 at 10:17

Active Mar 02 '23 at 11:15

Viewed 318 times

I am trying to use awswrangler.s3.merge_datasets() using a glob source string but it isn't working for me.

https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.s3.merge_datasets.html

import glob
import awswrangler as wr
wr.s3.merge_datasets(
    source_path=glob.escape(f"s3://my-bucket/data/*/individual_file.parquet"),
    target_path="s3://my-bucket/data/aggregated_files.parquet",
    mode="append",
    use_threads=True,
)

An empty list is returned.

Why doesn't this work? What am I doing wrong? Is there another way?

Thanks!

PS: In fact this doesn't even work for a single file - the globbing aside!

PPS: This answer https://stackoverflow.com/a/65816617/1021819 doesn't work for me.

PPPS: This might be the problem: https://stackoverflow.com/a/64261481/1021819 - but what then is the solution?

edited Mar 02 '23 at 11:15

asked Mar 02 '23 at 10:17

jtlz2

7,700
9
64
114

How do I merge several parquet files into one using awswrangler?

0 Answers0