47

I'm trying to read a CSV file from a private S3 bucket to a pandas dataframe:

df = pandas.read_csv('s3://mybucket/file.csv')

I can read a file from a public bucket, but reading a file from a private bucket results in HTTP 403: Forbidden error.

I have configured the AWS credentials using aws configure.

I can download a file from a private bucket using boto3, which uses aws credentials. It seems that I need to configure pandas to use AWS credentials, but don't know how.

IgorK
  • 483
  • 1
  • 5
  • 8

12 Answers12

51

Pandas uses boto (not boto3) inside read_csv. You might be able to install boto and have it work correctly.

There's some troubles with boto and python 3.4.4 / python3.5.1. If you're on those platforms, and until those are fixed, you can use boto 3 as

import boto3
import pandas as pd

s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
df = pd.read_csv(obj['Body'])

That obj had a .read method (which returns a stream of bytes), which is enough for pandas.

TomAugspurger
  • 28,234
  • 8
  • 86
  • 69
  • Works great. Two things: 1. I have boto installed and it imports fine as well as pandas, but still I get the 403. 2. 5th line of your code should read obj = s3.get_object...(instead of client.get_object...) – IgorK Mar 04 '16 at 21:18
  • 8
    I am trying this method with latest version of pandas 0.20 and boto3 and it looks like pandas doesn't like StreamingBody(). Is there a way to convert to stringIO? – spitfiredd May 07 '17 at 16:45
  • 2
    can you please make an example of your 'bucket' and 'key'. i get regex error. – Amir Nov 21 '17 at 21:51
  • 'bucket' is for the name of the bucket and 'key' is for the path of the file in the bucket. Using IgorK's example, it would be s3.get_object(Bucket='mybucket', Key='file.csv') – Zolzaya Luvsandorj Jul 14 '21 at 00:31
36

Updated for Pandas 0.20.1

Pandas now uses s3fs to handle s3 coonnections. link

pandas now uses s3fs for handling S3 connections. This shouldn’t break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas.

import os

import pandas as pd
from s3fs.core import S3FileSystem

# aws keys stored in ini file in same path
# refer to boto3 docs for config settings
os.environ['AWS_CONFIG_FILE'] = 'aws_config.ini'

s3 = S3FileSystem(anon=False)
key = 'path\to\your-csv.csv'
bucket = 'your-bucket-name'

df = pd.read_csv(s3.open('{}/{}'.format(bucket, key), mode='rb'))
# or with f-strings
df = pd.read_csv(s3.open(f'{bucket}/{key}', mode='rb'))
spitfiredd
  • 2,897
  • 5
  • 32
  • 75
  • 2
    this is a very convenient way of handling permissions – robertwest May 14 '17 at 02:12
  • I stumbled upon a few file not found errors when using this method even though the file exists in the bucket, it could either be the caching (default_fill_cache which instanciating s3fs) doing it's thing or s3 was trying to maintain read consistency because the bucket was not in sync across regions. I couldn't find out yet what's causing the file not found errors sometimes. – yash Dec 19 '22 at 19:13
28

Update for pandas 0.22 and up:

If you have already installed s3fs (pip install s3fs) then you can read the file directly from s3 path, without any imports:

data = pd.read_csv('s3://bucket....csv')

stable docs

Mike Fogel
  • 3,127
  • 28
  • 22
Isaac
  • 403
  • 4
  • 5
16

Based on this answer, I found smart_open to be much simpler to use:

import pandas as pd
from smart_open import smart_open

initial_df = pd.read_csv(smart_open('s3://bucket/file.csv'))
kepler
  • 1,712
  • 17
  • 18
  • 1
    I don't know if any or all of the other answers are "correct", but I believe you're most correct when you say "smart_open [is] much simpler to use." I had actually just discovered smart_open for this project, and had already developed something of a crust on it... but then I ran into a problem loading a pandas dataframe in AWS Lambda. I was at a loss for what to do until I saw your answer. I love it when I can solve a difficult problem with about 12 characters. Now I'm totally in love with smart_open <3 Thank you :) – Amandalishus Dec 16 '19 at 06:58
  • this does not work with `pd.read_hdf()`. `NotImplementedError: Support for generic buffers has not been implemented.` – crypdick Jun 09 '21 at 00:16
10
import pandas as pd
import boto3
from io import StringIO

# Read CSV
s3 = boto3.client('s3',endpoint_url,aws_access_key_id=,aws_secret_access_key)
read_file = s3.get_object(Bucket, Key)
df = pd.read_csv(read_file['Body'],sep=',')

# Write CSV
csv_buffer = StringIO()
df.to_csv(csv_buffer)
s3.put_object(Bucket, Key,Body=csv_buffer.getvalue())
Saeed Rahman
  • 101
  • 1
  • 3
  • 4
    When answering an old question, your answer would be much more useful to other StackOverflow users if you included some context to explain how your answer helps, particularly for a question that already has an accepted answer. See: [How do I write a good answer](https://stackoverflow.com/help/how-to-answer). – David Buck Dec 09 '19 at 01:22
3

In addition to other awesome answers, if a custom endpoint is required, it is possible to use pd.read_csv('s3://...') syntax by monkey patching the s3fs init method.

import s3fs
s3fsinit = s3fs.S3FileSystem.__init__
def s3fsinit_patched(self, *k, *kw):
    s3fsinit(self, *k, client_kwargs={'endpoint_url': 'https://yourcustomendpoint'}, **kw)
s3fs.S3FileSystem.__init__ = s3fsinit_patched

Or, a more elegant way:

import s3fs
class S3FileSystemPatched(s3fs.S3FileSystem):
    def __init__(self, *k, **kw):
        super(S3FileSystemPatched, self).__init__( *k,
                                                  key = os.environ['aws_access_key_id'],
                                                  secret = os.environ['aws_secret_access_key'],
                                                  client_kwargs={'endpoint_url': 'https://yourcustomendpoint'},
                                                  **kw)
        print('S3FileSystem is patched')
s3fs.S3FileSystem = S3FileSystemPatched

Also see: s3fs custom endpoint url

fmguler
  • 145
  • 6
2

Update for pandas 0.20.3 without using s3fs:

import boto3
import pandas as pd
import sys

if sys.version_info[0] < 3: 
    from StringIO import StringIO # Python 2.x
else:
    from io import StringIO # Python 3.x

s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
body = obj['Body']
csv_string = body.read().decode('utf-8')

df = pd.read_csv(StringIO(csv_string))
jpobst
  • 3,491
  • 2
  • 25
  • 24
2
import s3fs
import pandas as pd
s3 = s3fs.S3FileSystem(profile='<profile_name>')
pd.read_csv(s3.open(<s3_path>))

You can also manually use credentials.

Ze Tang
  • 49
  • 3
0

Note that if your bucket is private AND on an aws-like provider, you will meet errors as s3fs does not load the profile config file at ~/.aws/config like awscli.

One solution is to define the current environment variable :

export AWS_S3_ENDPOINT="myEndpoint"
export AWS_DEFAULT_REGION="MyRegion"
MCMZL
  • 1,078
  • 9
  • 21
0

The previous answers are a good basic start but I wanted to achieve advanced objectives stated below. Overall I feel awswrangler is the way to go.

  1. read .gzip
  2. read only the first 5 lines without downloading the full file
  3. explicitly pass credentials (make sure you don't commit them to code!!)
  4. use full s3 paths

Here are a couple of things that worked


import boto3
import pandas as pd
import awswrangler as wr

boto3_creds = dict(region_name="us-east-1", aws_access_key_id='', aws_secret_access_key='')
boto3.setup_default_session(**boto3_creds)

s3 = boto3.client('s3')

# read first 5 lines from file path
obj = s3.get_object(Bucket='bucket', Key='path.csv.gz')
df = pd.read_csv(obj['Body'], nrows=5, compression='gzip')

# read first 5 lines from directory
dft_xp = pd.concat(list(wr.s3.read_csv(wr.s3.list_objects('s3://bucket/path/')[0], chunksize=5, nrows=5, compression='gzip')))

# read all files into pandas
df_xp = wr.s3.read_csv(wr.s3.list_objects('s3://bucket/path/'), compression='gzip')

Did not use s3fs wasn't sure if it uses boto3.

For distributed compute with dask, this worked but it uses s3fs afaik and apparently gzip can't be parallized.


import dask.dataframe as dd

dd.read_csv('s3://bucket/path/*', storage_options={'key':'', 'secret':''}, compression='gzip').head(5)

dd.read_csv('s3://bucket/path/*', storage_options={'key':'', 'secret':''}, compression='gzip') 
# Warning gzip compression does not support breaking apart files Please ensure that each individual file can fit in memory
 

citynorman
  • 4,918
  • 3
  • 38
  • 39
0

To add on to the other solutions here with more recent updates: pandas, fsspec, and s3fs have all been updated such that you can reading directly from custom endpoints using pandas and no other imports. You must make sure you have both fsspec and s3fs installed, as they are optional dependencies for pandas.

Then you can use

import pandas as pd

pd.read_csv(
    's3://mybucket/file.csv',
    storage_options = {
        client_kwargs = {
            'endpoint_url': <MY_S3_ENDPOINT_URL>
        }
    }

Its clunky, but its required as maintainers of boto3 for some reason have refused to update the library to allow for custom endpoint configuration outside of client construction (i.e. in config files or environment variables) for years now. But if you don't need to be reading/writing through pandas, awswrangler is perhaps better as others have mentioned.

Bart Spoon
  • 93
  • 1
  • 7
0

You can use AWS SDK for Pandas, a library that extends Pandas to work smoothly with AWS data stores, such as S3.

import awswrangler as wr
df = wr.s3.read_csv("s3://bucket/file.csv")

It can be installed via pip install awswrangler.

Theofilos Papapanagiotou
  • 5,133
  • 1
  • 18
  • 24