Concatenating Multiple Objects into a single Pandas Dataframe with AWS S3 Bucket

Question

I am trying to use a function I found from this previous question Reading multiple csv files from S3 bucket with boto3 But I keep getting ValueError: DataFrame constructor not properly called!

This is the code below:

s3 = boto3.resource('s3',aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY)
bucket = s3.Bucket('test_bucket')
prefix_objs = bucket.objects.filter(Prefix=prefix)
prefix_df = []
for obj in prefix_objs:
    key = obj.key
    body = obj.get()['Body'].read()
    df = pd.DataFrame(body)

When I print body all I get is a bunch of string starting with a b'

can you read them all into a `spark dataframe` first ? then you can just call `.toPandas()` — Umar.H, Jan 04 '20 at 00:44
please review and let me know if my post below solves your issue? — thePurplePython, Jan 07 '20 at 19:28

score 1 · Accepted Answer · answered Jan 07 '20 at 02:13

1

I use this and it works well if all your files are in 1 prefix path. Basically you create the s3 client then iterate over each object in the prefix path followed by appending each file to an empty list for the concatenation via pandas.

import boto3
import pandas as pd

s3 = boto3.client("s3",\
                  region_name=region_name,\
                  aws_access_key_id=aws_access_key_id,\
                  aws_secret_access_key=aws_secret_access_key)

response = s3.list_objects(Bucket="my-bucket",\
                           Prefix="datasets/")

df_list = []

for file in response["Contents"]:
    obj = s3.get_object(Bucket="my-bucket", Key=file["Key"])
    obj_df = pd.read_csv(obj["Body"])
    df_list.append(obj_df)

df = pd.concat(df_list)

answered Jan 07 '20 at 02:13

thePurplePython

2,621
1
13
34

1

Does anyone have experience using this with a very large amount of files? This type of logic (append) is painfully slow for my use case of about 10K files. – user1983682 Feb 19 '20 at 23:11
data volume size? what file format? file size? are sizes skewed? i would use something like ```spark``` for the data engineering if you are working with terabyte sized datasets. – thePurplePython Feb 20 '20 at 01:30
JSON files, all are small but there are many (~10K at ~1.5KB). I started with Spark, however, I need to get specific files that meet last modified conditions, which doesn't seem to be an option with spark; At least not that I can find. – user1983682 Feb 20 '20 at 01:36
yea json doesn't have predicate pushdown so it scans the whole directory ... i have had issues with this in S3 ... you could try converting the json to parquet via lambda, kinesis, or a spark streaming app – thePurplePython Feb 20 '20 at 13:59

score 0 · Answer 2 · answered Aug 01 '22 at 08:59

If you install s3fs and fsspec, you can directly read with pd.read_csv to the s3 location, which is much faster than using s3.get_object:

import boto3
import pandas as pd

s3 = boto3.client("s3",\
                  region_name=region_name,\
                  aws_access_key_id=aws_access_key_id,\
                  aws_secret_access_key=aws_secret_access_key)

response = s3.list_objects(Bucket="my-bucket", Prefix="datasets/")

df = pd.concat([pd.read_csv(f"s3://my-bucket/{file['Key']}") for file in response['Contents']])

Concatenating Multiple Objects into a single Pandas Dataframe with AWS S3 Bucket

2 Answers2

Linked