How to import a text file on AWS S3 into pandas without writing to disk

Question

I have a text file saved on S3 which is a tab delimited table. I want to load it into pandas but cannot save it first because I am running on a heroku server. Here is what I have so far.

import io
import boto3
import os
import pandas as pd

os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxx"
os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxx"

s3_client = boto3.client('s3')
response = s3_client.get_object(Bucket="my_bucket",Key="filename.txt")
file = response["Body"]


pd.read_csv(file, header=14, delimiter="\t", low_memory=False)

the error is

OSError: Expected file path name or file-like object, got <class 'bytes'> type

How do I convert the response body into a format pandas will accept?

pd.read_csv(io.StringIO(file), header=14, delimiter="\t", low_memory=False)

returns

TypeError: initial_value must be str or None, not StreamingBody

pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)

returns

TypeError: 'StreamingBody' does not support the buffer interface

UPDATE - Using the following worked

file = response["Body"].read()

and

pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)

try it this way: `io.BytesIO(file)` or `io.StringIO(file)` instead of `file` in the `read_csv()` call — MaxU - stand with Ukraine, Jun 08 '16 at 13:21
You could use `io.StringIO` as in [this answer](http://stackoverflow.com/a/34447506/5276797). — IanS, Jun 08 '16 at 13:23
Neither of these suggestions worked. You can see the errors in my post edit. — alpalalpal, Jun 08 '16 at 18:38

score 164 · Answer 1 · edited Sep 23 '21 at 01:51

164

pandas uses boto for read_csv, so you should be able to:

import boto
data = pd.read_csv('s3://bucket....csv')

If you need boto3 because you are on python3.4+, you can

import boto3
import io
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
df = pd.read_csv(io.BytesIO(obj['Body'].read()))

Since version 0.20.1 pandas uses s3fs, see answer below.

edited Sep 23 '21 at 01:51

none

418
5
8

answered Jun 08 '16 at 13:26

Stefan

41,759
13
76
81

Is there a way to use a URL without making it public to everyone? The file needs to stay private. – alpalalpal Jun 08 '16 at 18:43
The `boto3` docs show how to configure authentication so that you can access private files as well: http://boto3.readthedocs.io/en/latest/guide/quickstart.html – Stefan Jun 10 '16 at 21:20
1

It is throwing NoCredentialsError. How do I set s3 credentials tl it? I am new to python and boto – Sunil Rao May 02 '17 at 10:32
17

I found that I had to do the following on the last example with boto3: `df = pd.read_csv(io.BytesIO(obj['Body'].read()), encoding='utf8')` – user394430 Jul 13 '17 at 22:19
1

This answer is **out of date**. Please see [Wesams answer](https://stackoverflow.com/a/51777553/974555). – gerrit Jan 23 '20 at 09:09
1

@Stefan Thank you so much, I've been looking for days on how to simply read a csv from S3 into pandas dataframe and this worked for me! (using Python 3.6) – Chris Dec 15 '20 at 19:21

Sam · Answer 2 · 2020-01-22T23:56:41.110

116

Now pandas can handle S3 URLs. You could simply do:

import pandas as pd
import s3fs

df = pd.read_csv('s3://bucket-name/file.csv')

You need to install s3fs if you don't have it. pip install s3fs

Authentication

If your S3 bucket is private and requires authentication, you have two options:

1- Add access credentials to your ~/.aws/credentials config file

[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Or

2- Set the following environment variables with their proper values:

aws_access_key_id
aws_secret_access_key
aws_session_token

edited Jan 22 '20 at 23:56

answered Aug 10 '18 at 00:56

Sam

11,799
9
49
68

1

Beautiful. Works in python3. – Kyler Brown Feb 27 '19 at 01:05
how about authentication ..? – James Wierzba Jun 03 '19 at 19:16
2

@JamesWierzba , I added more details on authentication to my answer above. – Sam Jun 03 '19 at 23:41
3

When dealing with multiple aws profiles, how can you select which profile should be used? s3fs has the profile_name option, but I'm not sure how that works with pandas. – Ivo Merchiers Jun 20 '19 at 12:57
This be the way to go. Not sure why *pandas* doesn't require and then import *s3fs* though? – rjurney Sep 18 '19 at 01:58
@IvoMerchiers have you found an answer to your question? – IanS Jan 10 '20 at 14:41
1

@IanS Not really, currently, I first open the file object with s3fs (using the specified profile) and then I read it with pandas, like they do here https://github.com/pandas-dev/pandas/issues/16692 – Ivo Merchiers Jan 13 '20 at 07:35
> When dealing with multiple aws profiles It will be supported as of 1.2.0 https://github.com/pandas-dev/pandas/blob/df32e83f36bf485be803be2b87d23135be30540a/pandas/io/parsers.py#L358 https://github.com/pandas-dev/pandas/pull/35381 `pd.read_csv('s3://hoge/piyo', storage_options={'aws_access_key_id': 'xxxxxxxxxx', 'aws_secret_access_key': 'yyyyyyyyyy'})` – gecko655 Nov 18 '20 at 08:55

score 22 · Answer 3 · edited Dec 10 '18 at 08:19

22

This is now supported in latest pandas. See

http://pandas.pydata.org/pandas-docs/stable/io.html#reading-remote-files

eg.,

df = pd.read_csv('s3://pandas-test/tips.csv')

edited Dec 10 '18 at 08:19

4b0

21,981
30
95
142

answered Dec 10 '18 at 08:04

Raveen Beemsingh

570
5
7

5

Remember 'S3 URLs are handled as well but require installing the S3Fs library' – Julio Villane May 09 '19 at 15:37
what about authenticating – James Wierzba Jun 03 '19 at 19:49
url with auth may be difficult unless, the url is exposed as public, not sure if simple/basic http auth will work, – Raveen Beemsingh Jul 09 '19 at 13:02

score 18 · Answer 4 · answered Dec 07 '20 at 21:05

For python 3.6+ Amazon now have a really nice library to use Pandas with their services, called awswrangler.

import awswrangler as wr
import boto3


# Boto3 session
session = boto3.session.Session(aws_access_key_id='XXXX', 
                                aws_secret_access_key='XXXX')

# Awswrangler pass forward all pd.read_csv() function args
df = wr.s3.read_csv(path='s3://bucket/path/',
                    boto3_session=session,
                    skiprows=2,
                    sep=';',
                    decimal=',',
                    na_values=['--'])

To install awswrangler: pip install awswrangler

If you're going to be using AWS/Python/Pandas/Glue/etc... AWS Wrangler is your new best friend. — Brian Wylie, May 01 '21 at 17:08

score 13 · Answer 5 · answered Dec 05 '17 at 14:00

13

With s3fs it can be done as follow:

import s3fs
import pandas as pd
fs = s3fs.S3FileSystem(anon=False)

# CSV
with fs.open('mybucket/path/to/object/foo.pkl') as f:
    df = pd.read_csv(f)

# Pickle
with fs.open('mybucket/path/to/object/foo.pkl') as f:
    df = pd.read_pickle(f)

answered Dec 05 '17 at 14:00

Dror

12,174
21
90
160

2

I think with s3fs you can even write `df = pd.read_csv('s3://mybucket/path/to/object/foo.pkl')` – louis_guitton Dec 19 '17 at 13:28
1

@louis_guitton this seems to work with pd-read_csv but not with read_pickle – Sip Jan 16 '19 at 15:11

score 3 · Answer 6 · answered Oct 05 '18 at 10:38

Since the files can be too large, it is not wise to load them in the dataframe altogether. Hence, read line by line and save it in the dataframe. Yes, we can also provide the chunk size in the read_csv but then we have to maintain the number of rows read.

Hence, I came up with this engineering:

def create_file_object_for_streaming(self):
        print("creating file object for streaming")
        self.file_object = self.bucket.Object(key=self.package_s3_key)
        print("File object is: " + str(self.file_object))
        print("Object file created.")
        return self.file_object

for row in codecs.getreader(self.encoding)(self.response[u'Body']).readlines():
            row_string = StringIO(row)
            df = pd.read_csv(row_string, sep=",")

I also delete the df once work is done. del df

score 2 · Answer 7 · answered Feb 27 '20 at 20:00

For text files, you can use below code with pipe-delimited file for example :-

import pandas as pd
import io
import boto3
s3_client = boto3.client('s3', use_ssl=False)
bucket = #
prefix = #
obj = s3_client.get_object(Bucket=bucket, Key=prefix+ filename)
df = pd.read_fwf((io.BytesIO(obj['Body'].read())) , encoding= 'unicode_escape', delimiter='|', error_bad_lines=False,header=None, dtype=str)

score 1 · Answer 8 · answered Mar 07 '23 at 03:47

import os
import pandas as pd
import boto3

session = boto3.Session(profile_name="test")

os.environ['AWS_ACCESS_KEY_ID'] = session.get_credentials().access_key
os.environ['AWS_SECRET_ACCESS_KEY'] = session.get_credentials().secret_key

So this way you can use any profile (AWS account) you have stored in ~/.aws/credentials

df = pd.read_csv("s3://xxxx.csv")

billmanH · Answer 9 · 2018-11-28T22:32:58.367

An option is to convert the csv to json via df.to_dict() and then store it as a string. Note this is only relevant if the CSV is not a requirement but you just want to quickly put the dataframe in an S3 bucket and retrieve it again.

from boto.s3.connection import S3Connection
import pandas as pd
import yaml

conn = S3Connection()
mybucket = conn.get_bucket('mybucketName')
myKey = mybucket.get_key("myKeyName")

myKey.set_contents_from_string(str(df.to_dict()))

This will convert the df to a dict string, and then save that as json in S3. You can later read it in the same json format:

df = pd.DataFrame(yaml.load(myKey.get_contents_as_string()))

The other solutions are also good, but this is a little simpler. Yaml may not necessarily be required but you need something to parse the json string. If the S3 file doesn't necessarily need to be a CSV this can be a quick fix.

score 0 · Answer 10 · answered Dec 16 '20 at 07:37

0

import s3fs
import pandas as pd
s3 = s3fs.S3FileSystem(profile='<profile_name>')
pd.read_csv(s3.open(<s3_path>))

answered Dec 16 '20 at 07:37

Ze Tang

49
3

2

Please add some explanation to your code. – andrey.shedko Dec 16 '20 at 08:23

How to import a text file on AWS S3 into pandas without writing to disk

10 Answers10

Authentication

Linked