2

I am trying to link my s3 bucket to a notebook instance, however i am not able to:

Here is how much I know:

from sagemaker import get_execution_role

role = get_execution_role
bucket = 'atwinebankloadrisk'
datalocation = 'atwinebankloadrisk'

data_location = 's3://{}/'.format(bucket)
output_location = 's3://{}/'.format(bucket)

to call the data from the bucket:

df_test = pd.read_csv(data_location/'application_test.csv')
df_train = pd.read_csv('./application_train.csv')
df_bureau = pd.read_csv('./bureau_balance.csv')

However I keep getting errors and unable to proceed. I haven't found answers that can assist much.

PS: I am new to this AWS

Atwine Mugume
  • 45
  • 1
  • 2
  • 6
  • You can pass s3 locations to your training jobs. I never saw that you can do this with a notebook instance. If you want the s3 data inside your notebook, than just download it via boto3 s3 client. – dennis-w Aug 16 '18 at 13:18

5 Answers5

7

You can load S3 Data into AWS SageMaker Notebook by using the sample code below. Do make sure the Amazon SageMaker role has policy attached to it to have access to S3.

[1] https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html

import boto3 
import botocore 
import pandas as pd 
from sagemaker import get_execution_role 

role = get_execution_role() 

bucket = 'Your_bucket_name' 
data_key = your_data_file.csv' 
data_location = 's3://{}/{}'.format(bucket, data_key) 

pd.read_csv(data_location) 
jmao
  • 110
  • 1
  • this works, how has nobody upvoted it!! pandas uses s3fs for handling s3 files, source: https://stackoverflow.com/questions/38154040/save-dataframe-to-csv-directly-to-s3-python – Itachi Apr 01 '20 at 12:35
3

You're trying to use Pandas to read files from S3 - Pandas can read files from your local disk, but not directly from S3.
Instead, download the files from S3 to your local disk, then use Pandas to read them.

import boto3
import botocore

BUCKET_NAME = 'my-bucket' # replace with your bucket name
KEY = 'my_image_in_s3.jpg' # replace with your object key

s3 = boto3.resource('s3')

try:
    # download as local file
    s3.Bucket(BUCKET_NAME).download_file(KEY, 'my_local_image.jpg')

    # OR read directly to memory as bytes:
    # bytes = s3.Object(BUCKET_NAME, KEY).get()['Body'].read() 
except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == "404":
        print("The object does not exist.")
    else:
        raise
Gili Nachum
  • 5,288
  • 4
  • 31
  • 33
1

You can use the https://s3fs.readthedocs.io/en/latest/ to read s3 files directly with pandas. The code below is taken from here

import os
import pandas as pd
from s3fs.core import S3FileSystem

os.environ['AWS_CONFIG_FILE'] = 'aws_config.ini'

s3 = S3FileSystem(anon=False)
key = 'path\to\your-csv.csv'
bucket = 'your-bucket-name'

df = pd.read_csv(s3.open('{}/{}'.format(bucket, key), mode='rb'))
dennis-w
  • 2,166
  • 1
  • 13
  • 23
1

In pandas 1.0.5, if you've already provided access to the notebook instance, reading a csv from S3 is as easy as this (https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#reading-remote-files):

df = pd.read_csv('s3://<bucket-name>/<filepath>.csv')

During the notebook setup process I attached a SageMakerFullAccess policy to the notebook instance granting it access to the S3 bucket. You can also do this via the IAM Management console.

If you need credentials, there's three ways to providing them (https://s3fs.readthedocs.io/en/latest/#credentials):

  • aws_access_key_id, aws_secret_access_key, and aws_session_token environment variables
  • configuration files such as ~/.aws/credentials
  • for nodes on EC2, the IAM metadata provider
openrory
  • 61
  • 6
0
import boto3

# files are referred as objects in S3.  
# file name is referred as key name in S3

def write_to_s3(filename, bucket_name, key):
    with open(filename,'rb') as f: # Read in binary mode
        return boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_fileobj(f)

# Simple call the write_to_s3 function with required argument  

write_to_s3('file_name.csv', 
            bucket_name,
            'file_name.csv')
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
ishwardgret
  • 1,068
  • 8
  • 10