9

I'm trying to load a large CSV (~5GB) into pandas from S3 bucket.

Following is the code I tried for a small CSV of 1.4 kb :

client = boto3.client('s3') 
obj = client.get_object(Bucket='grocery', Key='stores.csv')
body = obj['Body']
csv_string = body.read().decode('utf-8')
df = pd.read_csv(StringIO(csv_string))

This works well for a small CSV, but my requirement of loading a 5GB csv to pandas dataframe cannot be achieved through this (probably due to memory constraints when loading the csv by StringIO).

I also tried below code

s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
df = pd.read_csv(obj['Body'])

but this gives below error.

ValueError: Invalid file path or buffer object type: <class 'botocore.response.StreamingBody'>

Any help to resolve this error is much appreciated.

Dileepa Jayakody
  • 535
  • 1
  • 6
  • 19
  • 1
    The recommendation is to use a smaller dataset in your notebook instance and keep the larger datasets to the training jobs. The development cycles in the notebook should be quick to make sure that your time is used to develop and not to wait. The notebook instance has only 5GB of EBS, that you can increase if needed, but not recommended. – Guy Jan 22 '18 at 06:54
  • 3
    Use `df = pd.read_csv(io.BytesIO(obj['Body'].read()))` as mentioned in this answer https://stackoverflow.com/a/37703861/5238639 – prashanth Jun 25 '18 at 09:23

2 Answers2

14

I know this is quite late but here is an answer:

import boto3
bucket='sagemaker-dileepa' # Or whatever you called your bucket
data_key = 'data/stores.csv' # Where the file is within your bucket
data_location = 's3://{}/{}'.format(bucket, data_key)
df = pd.read_csv(data_location)
mish1818
  • 259
  • 3
  • 8
  • 2
    Are you sure that another import isn't needed? I tried the above and get an error "ImportError: The s3fs library is required to handle s3 files". – RandomTask Apr 28 '19 at 02:30
  • Did you define the role? See https://stackoverflow.com/questions/48264656/load-s3-data-into-aws-sagemaker-notebook – mish1818 Jul 22 '19 at 13:10
0

I found that copy the data "locally" to the notebook files makes reading the file much faster.

Hanan Shteingart
  • 8,480
  • 10
  • 53
  • 66