Load S3 Data into AWS SageMaker Notebook

Question

I've just started to experiment with AWS SageMaker and would like to load data from an S3 bucket into a pandas dataframe in my SageMaker python jupyter notebook for analysis.

I could use boto to grab the data from S3, but I'm wondering whether there is a more elegant method as part of the SageMaker framework to do this in my python code?

score 59 · Answer 1 · answered May 09 '18 at 02:59

59

import boto3
import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()
bucket='my-bucket'
data_key = 'train.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)

pd.read_csv(data_location)

answered May 09 '18 at 02:59

Chhoser

639
1
5
2

8

why did you import boto3? – Binyamin Even Oct 11 '19 at 08:55
6

why do you need the role? (see my answer to the question below) – ivankeller Oct 14 '19 at 09:21

ivankeller · Answer 2 · 2021-09-02T07:44:43.790

48

In the simplest case you don't need boto3, because you just read resources.
Then it's even simpler:

import pandas as pd

bucket='my-bucket'
data_key = 'train.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)

pd.read_csv(data_location)

But as Prateek stated make sure to configure your SageMaker notebook instance to have access to s3. This is done at configuration step in Permissions > IAM role

edited Sep 02 '21 at 07:44

answered May 09 '19 at 13:09

ivankeller

1,923
1
19
20

With that solution you avoid the credential headache, it's exactly what I was looking for, thank you. – Iakovos Belonias Jan 13 '20 at 18:40
I'm getting either a timeout or an Access Denied -- I have a folder between the file and bucket, so added that to end of bucket or begin of file -- I'm using root access, and don't think I have any protection on this bucket ? Does this (execution role) require an IAM? – Zach Oakes Jun 17 '20 at 14:21
Got it -- removing execution_role() fixed it -- great call. I was hoping something like this was available : ) – Zach Oakes Jun 17 '20 at 14:27

score 12 · Accepted Answer · answered Jan 15 '18 at 17:16

12

If you have a look here it seems you can specify this in the InputDataConfig. Search for "S3DataSource" (ref) in the document. The first hit is even in Python, on page 25/26.

answered Jan 15 '18 at 17:16

Jonatan

720
7
12

score 12 · Answer 4 · answered Jun 01 '19 at 08:40

12

You could also access your bucket as your file system using s3fs

import s3fs
fs = s3fs.S3FileSystem()

# To List 5 files in your accessible bucket
fs.ls('s3://bucket-name/data/')[:5]

# open it directly
with fs.open(f's3://bucket-name/data/image.png') as f:
    display(Image.open(f))

answered Jun 01 '19 at 08:40

CircleOnCircles

3,646
1
25
30

1

What are the advantages / disadvantages over the other way, I wonder – Hack-R Jun 06 '19 at 15:04
1

@Hack-R The pro is that you are able to use the python file pointer interface/object throughout the code. The con is that this object operates per file which might not be performance efficient. – CircleOnCircles Jun 14 '19 at 04:37
@Ben Thanks for this answer; however it's not working for me. I'm getting this error: `AttributeError: type object 'Image' has no attribute 'open'`. Can you share what library you're using for `Image` or any other details? Thanks! – Mabyn Jan 23 '20 at 19:38
1

Never mind, I just figured it out: `from IPython.display import display; from PIL import Image`. After that, the above worked great. Thanks! – Mabyn Jan 23 '20 at 19:48

score 5 · Answer 5 · answered Jan 16 '18 at 10:16

5

Do make sure the Amazon SageMaker role has policy attached to it to have access to S3. It can be done in IAM.

answered Jan 16 '18 at 10:16

Prateek Dubey

209
2
7

ivankeller · Answer 6 · 2021-09-02T07:46:52.107

4

You can also use AWS Data Wrangler https://github.com/awslabs/aws-data-wrangler:

import awswrangler as wr

df = wr.s3.read_csv(path="s3://...")

edited Sep 02 '21 at 07:46

answered Jan 14 '20 at 14:17

ivankeller

1,923
1
19
20

score 2 · Answer 7 · answered Jun 16 '21 at 21:03

2

A similar answer with the f-string.

import pandas as pd
bucket = 'your-bucket-name'
file = 'file.csv'
df = pd.read_csv(f"s3://{bucket}/{file}")
len(df) # print row counts

answered Jun 16 '21 at 21:03

Abu Shoeb

4,747
2
40
45

score 0 · Answer 8 · answered Nov 27 '20 at 06:27

This code sample to import csv file from S3, tested at SageMaker notebook.

Use pip or conda to install s3fs. !pip install s3fs

import pandas as pd

my_bucket = '' #declare bucket name
my_file = 'aa/bb.csv' #declare file path

import boto3 # AWS Python SDK
from sagemaker import get_execution_role
role = get_execution_role()

data_location = 's3://{}/{}'.format(my_bucket,my_file)
data=pd.read_csv(data_location)
data.head(2)

score 0 · Answer 9 · answered Dec 01 '22 at 17:31

There are multiple ways to read data into Sagemaker. To make the response more comprehensive i am adding details to read the data into Sagemaker Studio Notebook in memory as well as S3 mounting options.

Though Notebooks are not recommend for data intensive modeling and are more used for prototyping based on my experience, there are multiple ways the data can be read into it.

In Memory Based Options

Boto3
S3FS

Both Boto3 and S3FS can also be used in conjunction with python libraries like Pandas to read the data in memory as well as can also be used to copy the data to local instance EFS.

Mount Options

S3FS-Fuse (https://github.com/s3fs-fuse/s3fs-fuse)
Goofy (https://github.com/kahing/goofys)

These two options provide a mount like behaviour where the data appears to be in as if the local directory for higher IO operations. Both of these options have their pros and cons.

Load S3 Data into AWS SageMaker Notebook

9 Answers9

In Memory Based Options

Mount Options

Linked