2

I am trying to use a function I found from this previous question Reading multiple csv files from S3 bucket with boto3 But I keep getting ValueError: DataFrame constructor not properly called!

This is the code below:

s3 = boto3.resource('s3',aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY)
bucket = s3.Bucket('test_bucket')
prefix_objs = bucket.objects.filter(Prefix=prefix)
prefix_df = []
for obj in prefix_objs:
    key = obj.key
    body = obj.get()['Body'].read()
    df = pd.DataFrame(body)

When I print body all I get is a bunch of string starting with a b'

TH14
  • 622
  • 10
  • 24

2 Answers2

1

I use this and it works well if all your files are in 1 prefix path. Basically you create the s3 client then iterate over each object in the prefix path followed by appending each file to an empty list for the concatenation via pandas.

import boto3
import pandas as pd

s3 = boto3.client("s3",\
                  region_name=region_name,\
                  aws_access_key_id=aws_access_key_id,\
                  aws_secret_access_key=aws_secret_access_key)

response = s3.list_objects(Bucket="my-bucket",\
                           Prefix="datasets/")

df_list = []

for file in response["Contents"]:
    obj = s3.get_object(Bucket="my-bucket", Key=file["Key"])
    obj_df = pd.read_csv(obj["Body"])
    df_list.append(obj_df)

df = pd.concat(df_list)
thePurplePython
  • 2,621
  • 1
  • 13
  • 34
  • 1
    Does anyone have experience using this with a very large amount of files? This type of logic (append) is painfully slow for my use case of about 10K files. – user1983682 Feb 19 '20 at 23:11
  • data volume size? what file format? file size? are sizes skewed? i would use something like ```spark``` for the data engineering if you are working with terabyte sized datasets. – thePurplePython Feb 20 '20 at 01:30
  • JSON files, all are small but there are many (~10K at ~1.5KB). I started with Spark, however, I need to get specific files that meet last modified conditions, which doesn't seem to be an option with spark; At least not that I can find. – user1983682 Feb 20 '20 at 01:36
  • yea json doesn't have predicate pushdown so it scans the whole directory ... i have had issues with this in S3 ... you could try converting the json to parquet via lambda, kinesis, or a spark streaming app – thePurplePython Feb 20 '20 at 13:59
0

If you install s3fs and fsspec, you can directly read with pd.read_csv to the s3 location, which is much faster than using s3.get_object:

import boto3
import pandas as pd

s3 = boto3.client("s3",\
                  region_name=region_name,\
                  aws_access_key_id=aws_access_key_id,\
                  aws_secret_access_key=aws_secret_access_key)

response = s3.list_objects(Bucket="my-bucket", Prefix="datasets/")

df = pd.concat([pd.read_csv(f"s3://my-bucket/{file['Key']}") for file in response['Contents']])
ronkov
  • 1,263
  • 9
  • 14