I've been researching this topic for a few days now and have yet to come up with a working solution. Apologies if this question is repetitive (although I have checked for similar questions and have not quite found the right one).
I have an s3 bucket with about 150 parquet files in it. I have been searching for a dynamic way to bring in all of these files to one dataframe (can be multiple, if more computationally efficient). If all of these parquets were appended to one dataframe, it would be a very large amount of data, so if the solution to this is simply that I require more computing power, please do let me know. I have ultimately stumbled across the awswrangler, and am using the below code, which has been running as expected:
df = wr.s3.read_parquet(path="s3://my-s3-data/folder1/subfolder1/subfolder2/", dataset=True, columns = df_cols, chunked=True)
This code has been returning a generator object, which I am not sure how to get into a dataframe. I have tried solutions from the linked pages (below) and have returned various errors such as invalid filepath and length mismatch.
https://newbedev.com/create-a-pandas-dataframe-from-generator
https://aws-data-wrangler.readthedocs.io/en/stable/stubs/awswrangler.s3.read_parquet.html
Create a pandas DataFrame from generator?
Another solution I tried was from https://www.py4u.net/discuss/140245 :
import s3fs
import pyarrow.parquet as pq
fs = s3fs.S3FileSystem()
bucket = "cortex-grm-pdm-auto-design-data"
path = "s3://my-bucket/folder1/subfolder1/subfolder2/"
# Python 3.6 or later
p_dataset = pq.ParquetDataset(
f"s3://my-bucket/folder1/subfolder1/subfolder2/",
filesystem=fs
)
df = p_dataset.read().to_pandas()
which resulted in an error "'AioClientCreator' object has no attribute '_register_lazy_block_unknown_fips_pseudo_regions'"
lastly, I also tried the many parquet solution from https://newbedev.com/how-to-read-a-list-of-parquet-files-from-s3-as-a-pandas-dataframe-using-pyarrow :
# Read multiple parquets from a folder on S3 generated by spark
def pd_read_s3_multiple_parquets(filepath, bucket, s3=None,
s3_client=None, verbose=False, **args):
if not filepath.endswith('/'):
filepath = filepath + '/' # Add '/' to the end
if s3_client is None:
s3_client = boto3.client('s3')
if s3 is None:
s3 = boto3.resource('s3')
s3_keys = [item.key for item in s3.Bucket(bucket).objects.filter(Prefix=filepath)
if item.key.endswith('.parquet')]
if not s3_keys:
print('No parquet found in', bucket, filepath)
elif verbose:
print('Load parquets:')
for p in s3_keys:
print(p)
dfs = [pd_read_s3_parquet(key, bucket=bucket, s3_client=s3_client, **args)
for key in s3_keys]
return pd.concat(dfs, ignore_index=True)
df = pd_read_s3_multiple_parquets('path/to/folder', 'my_bucket')
This one returned no parquet found in the path (which I am certain is false, the parquets are all there when I visit the actual s3), as well as the error "no objects to concatenate"
Any guidance you can provide is greatly appreciated! Again, apologies for any repetitiveness in my question. Thank you in advance.