How to read a csv file from an s3 bucket using Pandas in Python

Question

I am trying to read a CSV file located in an AWS S3 bucket into memory as a pandas dataframe using the following code:

import pandas as pd
import boto

data = pd.read_csv('s3:/example_bucket.s3-website-ap-southeast-2.amazonaws.com/data_1.csv')

In order to give complete access I have set the bucket policy on the S3 bucket as follows:

{
"Version": "2012-10-17",
"Id": "statement1",
"Statement": [
    {
        "Sid": "statement1",
        "Effect": "Allow",
        "Principal": "*",
        "Action": "s3:*",
        "Resource": "arn:aws:s3:::example_bucket"
    }
]

}

Unfortunately I still get the following error in python:

boto.exception.S3ResponseError: S3ResponseError: 405 Method Not Allowed

Wondering if someone could help explain how to either correctly set the permissions in AWS S3 or configure pandas correctly to import the file. Thanks!

yes, you're right there should be. I also had to change the location of the bucket and file: tripData = pd.read_csv('htps://s3-ap-southeast-2.amazonaws.com/example_bucket/data.csv'). and I had to update the permissions on the individual file. but it works now. cheers. — Paul_M, Jun 13 '15 at 23:05
Please add your solution as an Answer to help other Stackoverflow users. — John Rotenstein, Jun 15 '15 at 05:19
When using `read_csv` to read files from s3, does pandas first downloads locally to disk and then loads into memory? Or does it streams from the network directly into the memory? — krackoder, Apr 05 '16 at 21:46

jpobst · Answer 1 · 2022-06-22T15:51:41.833

82

Using pandas 0.20.3

import boto3
import pandas as pd
import sys

if sys.version_info[0] < 3: 
    from StringIO import StringIO # Python 2.x
else:
    from io import StringIO # Python 3.x

client = boto3.client('s3')

bucket_name = 'my_bucket'

object_key = 'my_file.csv'
csv_obj = client.get_object(Bucket=bucket_name, Key=object_key)
body = csv_obj['Body']
csv_string = body.read().decode('utf-8')

df = pd.read_csv(StringIO(csv_string))

edited Jun 22 '22 at 15:51

answered Sep 20 '17 at 13:40

jpobst

3,491
2
25
24

When I import it this way the df's columns do not appear? – Matt M Oct 19 '17 at 14:42
I'm trying this and I'm getting errors in the id and secret key calls to os.environ -- is that something I have to set up in terminal or something? – Zach Oakes Jun 17 '20 at 13:30
@ZachOakes Yes, that's something you would have needed to set up. Those two lines assume that your ID and SECRET were previously saved as environment variables, but you don't need to pull them from environment variables. Instead, you can replace those two lines with whatever method you like to get your ID and SECRET into your code. – jpobst Jun 17 '20 at 18:54
2

Also works for DictReader: `reader = csv.DictReader(io.StringIO(body), fieldnames=fieldnames)` – Aaron Lelevier Jan 21 '21 at 23:52

score 9 · Answer 2 · edited Feb 14 '19 at 20:00

Based on this answer that suggested using smart_open for reading from S3, this is how I used it with Pandas:

import os
import pandas as pd
from smart_open import smart_open

aws_key = os.environ['AWS_ACCESS_KEY']
aws_secret = os.environ['AWS_SECRET_ACCESS_KEY']

bucket_name = 'my_bucket'
object_key = 'my_file.csv'

path = 's3://{}:{}@{}/{}'.format(aws_key, aws_secret, bucket_name, object_key)

df = pd.read_csv(smart_open(path))

score 5 · Answer 3 · edited Oct 16 '17 at 21:59

You don't need pandas.. you can just use the default csv library of python

def read_file(bucket_name,region, remote_file_name, aws_access_key_id, aws_secret_access_key):
    # reads a csv from AWS

    # first you stablish connection with your passwords and region id

    conn = boto.s3.connect_to_region(
        region,
        aws_access_key_id=aws_access_key_id,
        aws_secret_access_key=aws_secret_access_key)

    # next you obtain the key of the csv you want to read
    # you will need the bucket name and the csv file name

    bucket = conn.get_bucket(bucket_name, validate=False)
    key = Key(bucket)
    key.key = remote_file_name
    data = key.get_contents_as_string()
    key.close()

    # you store it into a string, therefore you will need to split it
    # usually the split characters are '\r\n' if not just read the file normally 
    # and find out what they are 

    reader = csv.reader(data.split('\r\n'))
    data = []
    header = next(reader)
    for row in reader:
        data.append(row)

    return data

hope it solved your problem, good luck! :)

score 3 · Answer 4 · answered Jan 26 '23 at 20:42

3

Without pandas (it's a big dependency just to read a csv file folks):

client = boto3.client("s3", region_name="eu-west-2")     
data = client.get_object(Bucket=bucket, Key=_file)    
reader = csv.DictReader(StringIO(data['Body'].read().decode('utf-8')))

answered Jan 26 '23 at 20:42

cs_stackX

1,407
2
19
27

How can I load only a fraction of the csv? – HuLu ViCa Jun 09 '23 at 17:37

score 2 · Answer 5 · answered Jun 16 '15 at 08:18

2

I eventually realised that you also need to set the permissions on each individual object within the bucket in order to extract it by using the following code:

from boto.s3.key import Key
k = Key(bucket)
k.key = 'data_1.csv'
k.set_canned_acl('public-read')

And I also had to modify the address of the bucket in the pd.read_csv command as follows:

data = pd.read_csv('https://s3-ap-southeast-2.amazonaws.com/example_bucket/data_1.csv')

answered Jun 16 '15 at 08:18

Paul_M

487
1
4
5

How to modify address to become a url that can be read by pandas? – nick_liu Sep 12 '17 at 07:51
3

You've made this file readable by anyone in the world which most people should probably avoid doing. @jpobst's answer above that provides the correct credentials to read the file is what most folks should do. – Carl Summers Feb 10 '20 at 11:08

Theofilos Papapanagiotou · Answer 6 · 2023-01-12T23:52:25.483

1

You can use AWS SDK for Pandas, a library that extends Pandas to work smoothly with AWS data stores, such as S3.

import awswrangler as wr
df = wr.s3.read_csv("s3://bucket/file.csv")

edited Jan 12 '23 at 23:52

answered Jan 12 '23 at 23:42

Theofilos Papapanagiotou

5,133
1
18
24

score 0 · Answer 7 · answered Jun 13 '23 at 06:04

You can try this:

 import boto3
 import pandas as pd

 s3_client = boto3.client(
    "s3",
    aws_access_key_id = ACCESS_KEY_ID,
    aws_secret_access_key = SECRET_ACCESS_KEY,
    endpoint_url = ENDPOINT_URL
    )
response = s3_client.get_object(Bucket=BUCKET_NAME, Key=OBJECT_KEY)
status = response.get("ResponseMetadata", {}).get("HTTPStatusCode")

if status == 200:
    df = pd.read_csv(response.get("Body"))
    print('Successfully read dataframe from S3')
else:
    print(f"Unsuccessful S3 get_object. Status: {status}")

How to read a csv file from an s3 bucket using Pandas in Python

7 Answers7

Linked