1

I've been researching this topic for a few days now and have yet to come up with a working solution. Apologies if this question is repetitive (although I have checked for similar questions and have not quite found the right one).

I have an s3 bucket with about 150 parquet files in it. I have been searching for a dynamic way to bring in all of these files to one dataframe (can be multiple, if more computationally efficient). If all of these parquets were appended to one dataframe, it would be a very large amount of data, so if the solution to this is simply that I require more computing power, please do let me know. I have ultimately stumbled across the awswrangler, and am using the below code, which has been running as expected:

df = wr.s3.read_parquet(path="s3://my-s3-data/folder1/subfolder1/subfolder2/", dataset=True, columns = df_cols, chunked=True)

This code has been returning a generator object, which I am not sure how to get into a dataframe. I have tried solutions from the linked pages (below) and have returned various errors such as invalid filepath and length mismatch.

https://newbedev.com/create-a-pandas-dataframe-from-generator

https://aws-data-wrangler.readthedocs.io/en/stable/stubs/awswrangler.s3.read_parquet.html

Create a pandas DataFrame from generator?

Another solution I tried was from https://www.py4u.net/discuss/140245 :

import s3fs
import pyarrow.parquet as pq

fs = s3fs.S3FileSystem()
bucket = "cortex-grm-pdm-auto-design-data"
path = "s3://my-bucket/folder1/subfolder1/subfolder2/"

# Python 3.6 or later
p_dataset = pq.ParquetDataset(
    f"s3://my-bucket/folder1/subfolder1/subfolder2/",
    filesystem=fs
)
df = p_dataset.read().to_pandas()

which resulted in an error "'AioClientCreator' object has no attribute '_register_lazy_block_unknown_fips_pseudo_regions'"

lastly, I also tried the many parquet solution from https://newbedev.com/how-to-read-a-list-of-parquet-files-from-s3-as-a-pandas-dataframe-using-pyarrow :

# Read multiple parquets from a folder on S3 generated by spark
def pd_read_s3_multiple_parquets(filepath, bucket, s3=None, 
                                 s3_client=None, verbose=False, **args):
    if not filepath.endswith('/'):
        filepath = filepath + '/'  # Add '/' to the end
    if s3_client is None:
        s3_client = boto3.client('s3')
    if s3 is None:
        s3 = boto3.resource('s3')
    s3_keys = [item.key for item in s3.Bucket(bucket).objects.filter(Prefix=filepath)
               if item.key.endswith('.parquet')]
    if not s3_keys:
        print('No parquet found in', bucket, filepath)
    elif verbose:
        print('Load parquets:')
        for p in s3_keys: 
            print(p)
    dfs = [pd_read_s3_parquet(key, bucket=bucket, s3_client=s3_client, **args) 
           for key in s3_keys]
    return pd.concat(dfs, ignore_index=True)

df = pd_read_s3_multiple_parquets('path/to/folder', 'my_bucket')

This one returned no parquet found in the path (which I am certain is false, the parquets are all there when I visit the actual s3), as well as the error "no objects to concatenate"

Any guidance you can provide is greatly appreciated! Again, apologies for any repetitiveness in my question. Thank you in advance.

  • [Arrow natively supports S3](https://arrow.apache.org/docs/python/dataset.html#reading-from-cloud-storage) you could try using that instead of s3fs. Also the error that is returned about '_register_lazy_block_unknown_fips_pseudo_regions' sounds like there might be something strange about your environment. – Micah Kornfield Nov 18 '21 at 20:12

1 Answers1

2

AWS data wrangler works seamlessly, I have used it.

  • Install via pip or conda.
  • Reading multiple parquet files is a one-liner: see example below.
  • Creds are automatically read from your environment variables.
# this is running on my laptop
import numpy as np
import pandas as pd
import awswrangler as wr

# assume multiple parquet files in 's3://mybucket/etc/etc/'
s3_bucket_uri = 's3://mybucket/etc/etc/'

df = wr.s3.read_parquet(path=s3_bucket_daily)

# df is a pandas DataFrame

AWS doc with examples that include your use case are here: https://aws-data-wrangler.readthedocs.io/en/stable/stubs/awswrangler.s3.read_parquet.html