131

I have a text file saved on S3 which is a tab delimited table. I want to load it into pandas but cannot save it first because I am running on a heroku server. Here is what I have so far.

import io
import boto3
import os
import pandas as pd

os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxx"
os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxx"

s3_client = boto3.client('s3')
response = s3_client.get_object(Bucket="my_bucket",Key="filename.txt")
file = response["Body"]


pd.read_csv(file, header=14, delimiter="\t", low_memory=False)

the error is

OSError: Expected file path name or file-like object, got <class 'bytes'> type

How do I convert the response body into a format pandas will accept?

pd.read_csv(io.StringIO(file), header=14, delimiter="\t", low_memory=False)

returns

TypeError: initial_value must be str or None, not StreamingBody

pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)

returns

TypeError: 'StreamingBody' does not support the buffer interface

UPDATE - Using the following worked

file = response["Body"].read()

and

pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)
alpalalpal
  • 1,583
  • 2
  • 12
  • 14

10 Answers10

164

pandas uses boto for read_csv, so you should be able to:

import boto
data = pd.read_csv('s3://bucket....csv')

If you need boto3 because you are on python3.4+, you can

import boto3
import io
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
df = pd.read_csv(io.BytesIO(obj['Body'].read()))

Since version 0.20.1 pandas uses s3fs, see answer below.

none
  • 418
  • 5
  • 8
Stefan
  • 41,759
  • 13
  • 76
  • 81
  • Is there a way to use a URL without making it public to everyone? The file needs to stay private. – alpalalpal Jun 08 '16 at 18:43
  • The `boto3` docs show how to configure authentication so that you can access private files as well: http://boto3.readthedocs.io/en/latest/guide/quickstart.html – Stefan Jun 10 '16 at 21:20
  • 1
    It is throwing NoCredentialsError. How do I set s3 credentials tl it? I am new to python and boto – Sunil Rao May 02 '17 at 10:32
  • 17
    I found that I had to do the following on the last example with boto3: `df = pd.read_csv(io.BytesIO(obj['Body'].read()), encoding='utf8')` – user394430 Jul 13 '17 at 22:19
  • 1
    This answer is **out of date**. Please see [Wesams answer](https://stackoverflow.com/a/51777553/974555). – gerrit Jan 23 '20 at 09:09
  • 1
    @Stefan Thank you so much, I've been looking for days on how to simply read a csv from S3 into pandas dataframe and this worked for me! (using Python 3.6) – Chris Dec 15 '20 at 19:21
116

Now pandas can handle S3 URLs. You could simply do:

import pandas as pd
import s3fs

df = pd.read_csv('s3://bucket-name/file.csv')

You need to install s3fs if you don't have it. pip install s3fs

Authentication

If your S3 bucket is private and requires authentication, you have two options:

1- Add access credentials to your ~/.aws/credentials config file

[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Or

2- Set the following environment variables with their proper values:

  • aws_access_key_id
  • aws_secret_access_key
  • aws_session_token
Sam
  • 11,799
  • 9
  • 49
  • 68
  • 1
    Beautiful. Works in python3. – Kyler Brown Feb 27 '19 at 01:05
  • how about authentication ..? – James Wierzba Jun 03 '19 at 19:16
  • 2
    @JamesWierzba , I added more details on authentication to my answer above. – Sam Jun 03 '19 at 23:41
  • 3
    When dealing with multiple aws profiles, how can you select which profile should be used? s3fs has the profile_name option, but I'm not sure how that works with pandas. – Ivo Merchiers Jun 20 '19 at 12:57
  • This be the way to go. Not sure why *pandas* doesn't require and then import *s3fs* though? – rjurney Sep 18 '19 at 01:58
  • @IvoMerchiers have you found an answer to your question? – IanS Jan 10 '20 at 14:41
  • 1
    @IanS Not really, currently, I first open the file object with s3fs (using the specified profile) and then I read it with pandas, like they do here https://github.com/pandas-dev/pandas/issues/16692 – Ivo Merchiers Jan 13 '20 at 07:35
  • > When dealing with multiple aws profiles It will be supported as of 1.2.0 https://github.com/pandas-dev/pandas/blob/df32e83f36bf485be803be2b87d23135be30540a/pandas/io/parsers.py#L358 https://github.com/pandas-dev/pandas/pull/35381 `pd.read_csv('s3://hoge/piyo', storage_options={'aws_access_key_id': 'xxxxxxxxxx', 'aws_secret_access_key': 'yyyyyyyyyy'})` – gecko655 Nov 18 '20 at 08:55
22

This is now supported in latest pandas. See

http://pandas.pydata.org/pandas-docs/stable/io.html#reading-remote-files

eg.,

df = pd.read_csv('s3://pandas-test/tips.csv')
4b0
  • 21,981
  • 30
  • 95
  • 142
18

For python 3.6+ Amazon now have a really nice library to use Pandas with their services, called awswrangler.

import awswrangler as wr
import boto3


# Boto3 session
session = boto3.session.Session(aws_access_key_id='XXXX', 
                                aws_secret_access_key='XXXX')

# Awswrangler pass forward all pd.read_csv() function args
df = wr.s3.read_csv(path='s3://bucket/path/',
                    boto3_session=session,
                    skiprows=2,
                    sep=';',
                    decimal=',',
                    na_values=['--'])

To install awswrangler: pip install awswrangler

Ricardo Mutti
  • 2,639
  • 2
  • 19
  • 20
13

With s3fs it can be done as follow:

import s3fs
import pandas as pd
fs = s3fs.S3FileSystem(anon=False)

# CSV
with fs.open('mybucket/path/to/object/foo.pkl') as f:
    df = pd.read_csv(f)

# Pickle
with fs.open('mybucket/path/to/object/foo.pkl') as f:
    df = pd.read_pickle(f)
Dror
  • 12,174
  • 21
  • 90
  • 160
3

Since the files can be too large, it is not wise to load them in the dataframe altogether. Hence, read line by line and save it in the dataframe. Yes, we can also provide the chunk size in the read_csv but then we have to maintain the number of rows read.

Hence, I came up with this engineering:

def create_file_object_for_streaming(self):
        print("creating file object for streaming")
        self.file_object = self.bucket.Object(key=self.package_s3_key)
        print("File object is: " + str(self.file_object))
        print("Object file created.")
        return self.file_object

for row in codecs.getreader(self.encoding)(self.response[u'Body']).readlines():
            row_string = StringIO(row)
            df = pd.read_csv(row_string, sep=",")

I also delete the df once work is done. del df

aviral sanjay
  • 953
  • 2
  • 14
  • 31
2

For text files, you can use below code with pipe-delimited file for example :-

import pandas as pd
import io
import boto3
s3_client = boto3.client('s3', use_ssl=False)
bucket = #
prefix = #
obj = s3_client.get_object(Bucket=bucket, Key=prefix+ filename)
df = pd.read_fwf((io.BytesIO(obj['Body'].read())) , encoding= 'unicode_escape', delimiter='|', error_bad_lines=False,header=None, dtype=str)
Hari_pb
  • 7,088
  • 3
  • 45
  • 53
1
import os
import pandas as pd
import boto3

session = boto3.Session(profile_name="test")

os.environ['AWS_ACCESS_KEY_ID'] = session.get_credentials().access_key
os.environ['AWS_SECRET_ACCESS_KEY'] = session.get_credentials().secret_key

So this way you can use any profile (AWS account) you have stored in ~/.aws/credentials

df = pd.read_csv("s3://xxxx.csv")
0

An option is to convert the csv to json via df.to_dict() and then store it as a string. Note this is only relevant if the CSV is not a requirement but you just want to quickly put the dataframe in an S3 bucket and retrieve it again.

from boto.s3.connection import S3Connection
import pandas as pd
import yaml

conn = S3Connection()
mybucket = conn.get_bucket('mybucketName')
myKey = mybucket.get_key("myKeyName")

myKey.set_contents_from_string(str(df.to_dict()))

This will convert the df to a dict string, and then save that as json in S3. You can later read it in the same json format:

df = pd.DataFrame(yaml.load(myKey.get_contents_as_string()))

The other solutions are also good, but this is a little simpler. Yaml may not necessarily be required but you need something to parse the json string. If the S3 file doesn't necessarily need to be a CSV this can be a quick fix.

billmanH
  • 1,298
  • 1
  • 14
  • 25
0
import s3fs
import pandas as pd
s3 = s3fs.S3FileSystem(profile='<profile_name>')
pd.read_csv(s3.open(<s3_path>))
Ze Tang
  • 49
  • 3