34

I am trying to read a CSV file located in an AWS S3 bucket into memory as a pandas dataframe using the following code:

import pandas as pd
import boto

data = pd.read_csv('s3:/example_bucket.s3-website-ap-southeast-2.amazonaws.com/data_1.csv')

In order to give complete access I have set the bucket policy on the S3 bucket as follows:

{
"Version": "2012-10-17",
"Id": "statement1",
"Statement": [
    {
        "Sid": "statement1",
        "Effect": "Allow",
        "Principal": "*",
        "Action": "s3:*",
        "Resource": "arn:aws:s3:::example_bucket"
    }
]

}

Unfortunately I still get the following error in python:

boto.exception.S3ResponseError: S3ResponseError: 405 Method Not Allowed

Wondering if someone could help explain how to either correctly set the permissions in AWS S3 or configure pandas correctly to import the file. Thanks!

Paul_M
  • 487
  • 1
  • 4
  • 5
  • Shouldn't there be a double slash after s3? – Don Roby Jun 13 '15 at 12:00
  • 1
    yes, you're right there should be. I also had to change the location of the bucket and file: tripData = pd.read_csv('htps://s3-ap-southeast-2.amazonaws.com/example_bucket/data.csv'). and I had to update the permissions on the individual file. but it works now. cheers. – Paul_M Jun 13 '15 at 23:05
  • 1
    Please add your solution as an Answer to help other Stackoverflow users. – John Rotenstein Jun 15 '15 at 05:19
  • 2
    When using `read_csv` to read files from s3, does pandas first downloads locally to disk and then loads into memory? Or does it streams from the network directly into the memory? – krackoder Apr 05 '16 at 21:46

7 Answers7

82

Using pandas 0.20.3

import boto3
import pandas as pd
import sys

if sys.version_info[0] < 3: 
    from StringIO import StringIO # Python 2.x
else:
    from io import StringIO # Python 3.x

client = boto3.client('s3')

bucket_name = 'my_bucket'

object_key = 'my_file.csv'
csv_obj = client.get_object(Bucket=bucket_name, Key=object_key)
body = csv_obj['Body']
csv_string = body.read().decode('utf-8')

df = pd.read_csv(StringIO(csv_string))
jpobst
  • 3,491
  • 2
  • 25
  • 24
  • When I import it this way the df's columns do not appear? – Matt M Oct 19 '17 at 14:42
  • I'm trying this and I'm getting errors in the id and secret key calls to os.environ -- is that something I have to set up in terminal or something? – Zach Oakes Jun 17 '20 at 13:30
  • @ZachOakes Yes, that's something you would have needed to set up. Those two lines assume that your ID and SECRET were previously saved as environment variables, but you don't need to pull them from environment variables. Instead, you can replace those two lines with whatever method you like to get your ID and SECRET into your code. – jpobst Jun 17 '20 at 18:54
  • 2
    Also works for DictReader: `reader = csv.DictReader(io.StringIO(body), fieldnames=fieldnames)` – Aaron Lelevier Jan 21 '21 at 23:52
9

Based on this answer that suggested using smart_open for reading from S3, this is how I used it with Pandas:

import os
import pandas as pd
from smart_open import smart_open

aws_key = os.environ['AWS_ACCESS_KEY']
aws_secret = os.environ['AWS_SECRET_ACCESS_KEY']

bucket_name = 'my_bucket'
object_key = 'my_file.csv'

path = 's3://{}:{}@{}/{}'.format(aws_key, aws_secret, bucket_name, object_key)

df = pd.read_csv(smart_open(path))
Jhirschibar
  • 215
  • 1
  • 3
  • 16
kepler
  • 1,712
  • 17
  • 18
5

You don't need pandas.. you can just use the default csv library of python

def read_file(bucket_name,region, remote_file_name, aws_access_key_id, aws_secret_access_key):
    # reads a csv from AWS

    # first you stablish connection with your passwords and region id

    conn = boto.s3.connect_to_region(
        region,
        aws_access_key_id=aws_access_key_id,
        aws_secret_access_key=aws_secret_access_key)

    # next you obtain the key of the csv you want to read
    # you will need the bucket name and the csv file name

    bucket = conn.get_bucket(bucket_name, validate=False)
    key = Key(bucket)
    key.key = remote_file_name
    data = key.get_contents_as_string()
    key.close()

    # you store it into a string, therefore you will need to split it
    # usually the split characters are '\r\n' if not just read the file normally 
    # and find out what they are 

    reader = csv.reader(data.split('\r\n'))
    data = []
    header = next(reader)
    for row in reader:
        data.append(row)

    return data

hope it solved your problem, good luck! :)

Community
  • 1
  • 1
3

Without pandas (it's a big dependency just to read a csv file folks):

client = boto3.client("s3", region_name="eu-west-2")     
data = client.get_object(Bucket=bucket, Key=_file)    
reader = csv.DictReader(StringIO(data['Body'].read().decode('utf-8')))
cs_stackX
  • 1,407
  • 2
  • 19
  • 27
2

I eventually realised that you also need to set the permissions on each individual object within the bucket in order to extract it by using the following code:

from boto.s3.key import Key
k = Key(bucket)
k.key = 'data_1.csv'
k.set_canned_acl('public-read')

And I also had to modify the address of the bucket in the pd.read_csv command as follows:

data = pd.read_csv('https://s3-ap-southeast-2.amazonaws.com/example_bucket/data_1.csv')
Paul_M
  • 487
  • 1
  • 4
  • 5
  • How to modify address to become a url that can be read by pandas? – nick_liu Sep 12 '17 at 07:51
  • 3
    You've made this file readable by anyone in the world which most people should probably avoid doing. @jpobst's answer above that provides the correct credentials to read the file is what most folks should do. – Carl Summers Feb 10 '20 at 11:08
1

You can use AWS SDK for Pandas, a library that extends Pandas to work smoothly with AWS data stores, such as S3.

import awswrangler as wr
df = wr.s3.read_csv("s3://bucket/file.csv")
Theofilos Papapanagiotou
  • 5,133
  • 1
  • 18
  • 24
0

You can try this:

 import boto3
 import pandas as pd

 s3_client = boto3.client(
    "s3",
    aws_access_key_id = ACCESS_KEY_ID,
    aws_secret_access_key = SECRET_ACCESS_KEY,
    endpoint_url = ENDPOINT_URL
    )
response = s3_client.get_object(Bucket=BUCKET_NAME, Key=OBJECT_KEY)
status = response.get("ResponseMetadata", {}).get("HTTPStatusCode")

if status == 200:
    df = pd.read_csv(response.get("Body"))
    print('Successfully read dataframe from S3')
else:
    print(f"Unsuccessful S3 get_object. Status: {status}")
zabidima
  • 21
  • 5