206

I have a pandas DataFrame that I want to upload to a new CSV file. The problem is that I don't want to save the file locally before transferring it to s3. Is there any method like to_csv for writing the dataframe to s3 directly? I am using boto3.
Here is what I have so far:

import boto3
s3 = boto3.client('s3', aws_access_key_id='key', aws_secret_access_key='secret_key')
read_file = s3.get_object(Bucket, Key)
df = pd.read_csv(read_file['Body'])

# Make alterations to DataFrame

# Then export DataFrame to CSV through direct transfer to s3
Cinder Biscuits
  • 4,880
  • 31
  • 51
user2494275
  • 2,063
  • 2
  • 13
  • 4

15 Answers15

263

You can use:

from io import StringIO # python3; python2: BytesIO 
import boto3

bucket = 'my_bucket_name' # already created on S3
csv_buffer = StringIO()
df.to_csv(csv_buffer)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, 'df.csv').put(Body=csv_buffer.getvalue())
Stefan
  • 41,759
  • 13
  • 76
  • 81
  • from io import StringIO , so doesn't seem to allow edits less than 6 characters. – Satyadev May 09 '17 at 06:19
  • 22
    If this is a large file, what does this do to memory...? – citynorman Jan 03 '18 at 22:33
  • 4
    If the file is bigger then the RAM you have available the action will fail and will except an Exception (don't know which one). This should be accepted as answer – Eran Moshe Jan 09 '18 at 12:16
  • 8
    I got `TypeError: unicode argument expected, got 'str'` error while using `StringIO`. I used `BytesIO` and it worked perfectly fine. _Note: this was in Python 2.7_ – Abhishek Upadhyaya Feb 01 '18 at 17:42
  • 3
    what is `bucket` object? how did you create that? – Charles Chow Jun 07 '19 at 23:27
  • 2
    `bucket` is where you store objects on S3. The code assumes you have already created the destination (think: directory) where to store this. See [S3 docs](https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html) – Stefan Jun 08 '19 at 14:42
  • something like : bucket= s3_resource.Bucket('bucket_name') ? – FrankyBravo Aug 02 '19 at 12:00
  • Got error: AttributeError: 'DataFrame' object has no attribute 'to_csv' – nik Apr 20 '20 at 17:40
  • I got Access Denied for a while because the bucket was using different access keys than my default. If this is the case, then you need to also specify the S3 session with this snippet of code before calling the session.resource(): session = boto3.Session( aws_access_key_id=, aws_secret_access_key=, ) – shad Apr 29 '20 at 16:59
  • This works, but changes the owner of the files to my account. Is there a way to keep the Bucket owner as the owner of the written files? – Hamman Samuel Aug 10 '21 at 16:44
  • Where the content of dataframe? like df_csv = pd.DataFrame(df, columns=['text']) – MaSao93 Feb 18 '22 at 03:11
  • 1
    is this approach thread safe ? want to run this in multi-threaded environment – Vineet Nov 17 '22 at 10:17
  • my script runs fine with no errors but the file never uploads – Raksha Dec 19 '22 at 18:26
156

You can directly use the S3 path. I am using Pandas 0.24.1

In [1]: import pandas as pd

In [2]: df = pd.DataFrame( [ [1, 1, 1], [2, 2, 2] ], columns=['a', 'b', 'c'])

In [3]: df
Out[3]:
   a  b  c
0  1  1  1
1  2  2  2

In [4]: df.to_csv('s3://experimental/playground/temp_csv/dummy.csv', index=False)

In [5]: pd.__version__
Out[5]: '0.24.1'

In [6]: new_df = pd.read_csv('s3://experimental/playground/temp_csv/dummy.csv')

In [7]: new_df
Out[7]:
   a  b  c
0  1  1  1
1  2  2  2

Release Note:

S3 File Handling

pandas now uses s3fs for handling S3 connections. This shouldn’t break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas. GH11915.

Community
  • 1
  • 1
yardstick17
  • 4,322
  • 1
  • 26
  • 33
  • 16
    this is definitely the easiest answer now, it uses s3fs behind the scenes so you need to add that to your requirements.txt – JD D Jun 03 '19 at 18:34
  • 3
    I like it is easy, but it seems it's not really working since I keep getting the following error ```NoCredentialsError: Unable to locate credentials```. Any suggestions? – CathyQian Aug 09 '19 at 18:43
  • 2
    I can confirm this does not work with pandas <= 0.23.4, so be sure to upgrade to pandas 0.24 – Guido Aug 13 '19 at 08:31
  • 1
    This is the error i see when i try to use to_csv command TypeError: write() argument 1 must be unicode, not str – Raj Sep 01 '19 at 10:27
  • 15
    I'm using pandas 0.24.2 and what I get is `NotImplementedError: Text mode not supported, use mode='wb' and manage bytes`. any suggestions? – Binyamin Even Oct 28 '19 at 09:33
  • When I use this code , the rows in the CSV file in S3 every line was skipped , what could be the solution ? – manda1238 Sep 14 '20 at 11:56
  • This is great if you're using AWS Lambda to execute your code too BTW. Just create a lambda layer with the install files for pandas and s3fs, and you're good to go: direct communication with buckets in your AWS instance. – talbe009 Nov 12 '20 at 21:00
  • if you want to use AWS Profiles (IAM roles) you just: `df.to_csv(path_or_buf="s3://your-bucket/key/df.csv", storage_options={"profile":"your-profile-name"})` – rezan21 Aug 26 '21 at 19:53
  • 1
    This link shows clearly how to pass credentials for pandas API which uses s3fs https://towardsdatascience.com/reading-and-writing-files-from-to-amazon-s3-with-pandas-ccaf90bfe86c – jason Nov 20 '21 at 22:27
  • I'm getting an error `AttributeError: 'AioClientCreator' object has no attribute '_register_lazy_block_unknown_fips_pseudo_regions'` when I try to do it this way (this is with fsspec and s3fs installed). – szeitlin Dec 30 '21 at 18:08
  • This work when working from AWS SageMaker Studio notebooks: `df.to_csv(f"s3://{bucket}/{prefix}/data.csv", index=False, header=False)`, without having to manually add some other dependencies. – Jordi Sabater Picañol Jun 13 '22 at 14:07
  • Per @jason's [link](https://towardsdatascience.com/reading-and-writing-files-from-to-amazon-s3-with-pandas-ccaf90bfe86c), you can pass credentials/keys using the storage_options parameter (in pandas v.1.2.0+): `aws_credentials = { "key": "***", "secret": "***", "token": "***" } df.to_csv("s3://...", index=False, storage_options=aws_credentials)` or use one of the methods from s3fs here: https://s3fs.readthedocs.io/en/latest/#credentials – Gabe Jun 24 '22 at 21:09
70

I like s3fs which lets you use s3 (almost) like a local filesystem.

You can do this:

import s3fs

bytes_to_write = df.to_csv(None).encode()
fs = s3fs.S3FileSystem(key=key, secret=secret)
with fs.open('s3://bucket/path/to/file.csv', 'wb') as f:
    f.write(bytes_to_write)

s3fs supports only rb and wb modes of opening the file, that's why I did this bytes_to_write stuff.

michcio1234
  • 1,700
  • 13
  • 18
  • Great! How can I get the file url using same s3fs module? – M.Zaman Oct 13 '17 at 12:09
  • I was looking for the URL from where I can download the written file, anyways I get that via S3FileSystem. Thanks – M.Zaman Oct 17 '17 at 06:26
  • this is what i use; thanks. I am curious why pd.read_csv() works as expected but for writing we have to use this work around.. except in the case that i'm writing directly to the s3 bucket my jupyter is in. – Renée Aug 29 '18 at 10:44
  • 1
    @michcio1234 how can i do the same in append mode ? I need to append the data in existing csv on s3 – j ' Dec 05 '19 at 07:21
  • 1
    @j' `s3fs` doesn't seem to support append mode. – michcio1234 Dec 06 '19 at 10:35
57

This is a more up to date answer:

import s3fs

s3 = s3fs.S3FileSystem(anon=False)

# Use 'w' for py3, 'wb' for py2
with s3.open('<bucket-name>/<filename>.csv','w') as f:
    df.to_csv(f)

The problem with StringIO is that it will eat away at your memory. With this method, you are streaming the file to s3, rather than converting it to string, then writing it into s3. Holding the pandas dataframe and its string copy in memory seems very inefficient.

If you are working in an ec2 instant, you can give it an IAM role to enable writing it to s3, thus you dont need to pass in credentials directly. However, you can also connect to a bucket by passing credentials to the S3FileSystem() function. See documention:https://s3fs.readthedocs.io/en/latest/

erncyp
  • 1,649
  • 21
  • 23
  • 1
    For some reason when I did this every line was skipped in the output CSV – kjmerf Aug 20 '19 at 22:40
  • hmm. not sure why that would happen. perhaps try with another pandas df to see if you still get the problem? If your version of pandas supports it, try @amit-kushwaha 's answer, where you pass the s3 url directly to `to_csv()`. seems like a cleaner implementation. – erncyp Sep 04 '19 at 09:38
  • @erncyp I seem to be getting there error: `botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the PutObject operation: Access Denied` ... I have even made the bucket PUBLIC READ and I have added the following Actions, under my specific account IAM user, in the Bucket Policy:`"Action": [ "s3:PutObject", "s3:PutObjectAcl", "s3:GetObject", "s3:GetObjectAcl", "s3:DeleteObject" ]` – ajoros Apr 29 '20 at 09:46
  • seems like you are lacking the permissions? Make sure to attach S3 read write permissions to the IAM role you are using – erncyp Apr 29 '20 at 10:15
  • @erncyp I have AdministratorAccess policy attached to my IAM user, so in theory I should be able to read/write just fine... Oddly, I am able to write just fine when I use the following function I made, using another StackOverflow user's advice (fyi semi-colons are end-of-line since i dont know how to format in comment section): `def send_to_bucket(df, fn_out, bucketname): csv_buffer = StringIO(); df.to_csv(csv_buffer); s3_resource = boto3.resource('s3'); s3_resource.Object(bucketname, fn_out).put(Body=csv_buffer.getvalue());` – ajoros May 01 '20 at 08:12
  • @erncyp got it to work, it was an IDE issue with PyCharm, super weird. Worked in terminal but didnt work in iPython terminal via PyCharm. Needed to restart the kernel and then it worked. Super weird. – ajoros May 01 '20 at 08:53
  • is s3fs approach thread safe ? – Vineet Nov 17 '22 at 10:09
  • I get "botocore.exceptions.NoCredentialsError: Unable to locate credentials" – Raksha Dec 19 '22 at 18:20
  • when I tried this I get: OSError: [Errno 22] The XML you provided was not well-formed or did not validate against our published schema. did anyone managed to solve it? – Rachel Shalom May 31 '23 at 08:57
17

If you pass None as the first argument to to_csv() the data will be returned as a string. From there it's an easy step to upload that to S3 in one go.

It should also be possible to pass a StringIO object to to_csv(), but using a string will be easier.

mhawke
  • 84,695
  • 9
  • 117
  • 138
  • 1
    Will be easier in which way? What is the correct way to do it ? – Eran Moshe Jan 09 '18 at 13:24
  • @EranMoshe: either way will work correctly, but obviously it's easier to pass `None` to `to_csv()` and use the returned string than it is to create a `StringIO` object and then read the data back out. – mhawke Jan 10 '18 at 00:26
  • As a lazy programmer that's what I did. And you meant easier for the programmer who writes less code :> – Eran Moshe Jan 10 '18 at 06:06
17

You can also use the AWS Data Wrangler:

import awswrangler as wr
    
wr.s3.to_csv(
    df=df,
    path="s3://...",
)

Note that it will handle multipart upload for you to make the upload faster.

Tulio Casagrande
  • 1,499
  • 1
  • 15
  • 20
gabra
  • 9,484
  • 4
  • 29
  • 45
12

I found this can be done using client also and not just resource.

from io import StringIO
import boto3
s3 = boto3.client("s3",\
                  region_name=region_name,\
                  aws_access_key_id=aws_access_key_id,\
                  aws_secret_access_key=aws_secret_access_key)
csv_buf = StringIO()
df.to_csv(csv_buf, header=True, index=False)
csv_buf.seek(0)
s3.put_object(Bucket=bucket, Body=csv_buf.getvalue(), Key='path/test.csv')
Hari_pb
  • 7,088
  • 3
  • 45
  • 53
5

I use AWS Data Wrangler. For example:

import awswrangler as wr
import pandas as pd

# read a local dataframe
df = pd.read_parquet('my_local_file.gz')

# upload to S3 bucket
wr.s3.to_parquet(df=df, path='s3://mys3bucket/file_name.gz')

The same applies to csv files. Instead of read_parquet and to_parquet, use read_csv and to_csv with the proper file extension.

Aziz Alto
  • 19,057
  • 5
  • 77
  • 60
2

since you are using boto3.client(), try:

import boto3
from io import StringIO #python3 
s3 = boto3.client('s3', aws_access_key_id='key', aws_secret_access_key='secret_key')
def copy_to_s3(client, df, bucket, filepath):
    csv_buf = StringIO()
    df.to_csv(csv_buf, header=True, index=False)
    csv_buf.seek(0)
    client.put_object(Bucket=bucket, Body=csv_buf.getvalue(), Key=filepath)
    print(f'Copy {df.shape[0]} rows to S3 Bucket {bucket} at {filepath}, Done!')

copy_to_s3(client=s3, df=df_to_upload, bucket='abc', filepath='def/test.csv')
jerrytim
  • 668
  • 6
  • 11
2

You can use

  • pandas
  • boto3
  • s3fs (version ≤0.4)

I use to_csv with s3:// in path and storage_options

key = "folder/file.csv"

df.to_csv(
    f"s3://{YOUR_S3_BUCKET}/{key}",
    index=False,
    storage_options={
        "key": AWS_ACCESS_KEY_ID,
        "secret": AWS_SECRET_ACCESS_KEY,
        "token": AWS_SESSION_TOKEN,
    },
Ruscinc
  • 56
  • 5
1
from io import StringIO
import boto3
#Creating Session With Boto3.
session = boto3.Session(
aws_access_key_id='<your_access_key_id>',
aws_secret_access_key='<your_secret_access_key>'
)
#Creating S3 Resource From the Session.
s3_res = session.resource('s3')
csv_buffer = StringIO()
df.to_csv(csv_buffer)
bucket_name = 'stackvidhya'
s3_object_name = 'df.csv'
s3_res.Object(bucket_name, s3_object_name).put(Body=csv_buffer.getvalue())
print("Dataframe is saved as CSV in S3 bucket.")
techiemas
  • 11
  • 1
  • 3
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 14 '22 at 06:01
0

To handle large files efficiently you can also use an open-source S3-compatible MinIO, with its minio python client package, like in this function of mine:

import minio
import os
import pandas as pd

minio_client = minio.Minio(..)

def write_df_to_minio(df, 
                    minio_client, 
                    bucket_name, 
                    file_name="new-file.csv",
                    local_temp_folder="/tmp/", 
                    content_type="application/csv",
                    sep=",",
                    save_row_index=False):

    df.to_csv(os.path.join(local_temp_folder, file_name), sep=sep, index=save_row_index)
    
    minio_results = minio_client.fput_object(bucket_name=bucket_name,
                                             object_name=file_name,
                                             file_path=os.path.join(local_temp_folder, file_name),
                                             content_type=content_type)

    assert minio_results.object_name == file_name

mirekphd
  • 4,799
  • 3
  • 38
  • 59
0

Another option is to do this with cloudpathlib, which supports S3 and also Google Cloud Storage and Azure Blob Storage. See example below.

import pandas as pd
from cloudpathlib import CloudPath

# read data from S3
df = pd.read_csv(CloudPath("s3://covid19-lake/rearc-covid-19-testing-data/csv/states_daily/states_daily.csv"))

# look at some of the data
df.head(1).T.iloc[:10]
#>                                       0
#> date                           20210307
#> state                                AK
#> positive                        56886.0
#> probableCases                       NaN
#> negative                            NaN
#> pending                             NaN
#> totalTestResultsSource  totalTestsViral
#> totalTestResults              1731628.0
#> hospitalizedCurrently              33.0
#> hospitalizedCumulative           1293.0

# writing to S3
with CloudPath("s3://bucket-you-can-write-to/data.csv").open("w") as f:
    df.to_csv(f)

CloudPath("s3://bucket-you-can-write-to/data.csv").exists()
#> True

Note, that you can't call df.to_csv(CloudPath("s3://drivendata-public-assets/test-asdf2.csv")) directly because of the way pandas handles paths/handles passed to it. Instead you need to open the file for writing and pass that handle directly to to_csv.

This comes with a few added benefits in terms of setting particular options or different authentication mechanisms or keeping a persistent cache so you don't always need to redownload from S3.

hume
  • 2,413
  • 19
  • 21
0

For those who might have problems with S3FS or fsspec using Lambda:

You have to create a layer for each libary and insert them in your Lambda.

You can find how to crate a layer here.

Luis Felipe
  • 148
  • 9
-5

I read a csv with two columns from bucket s3, and the content of the file csv i put in pandas dataframe.

Example:

config.json

{
  "credential": {
    "access_key":"xxxxxx",
    "secret_key":"xxxxxx"
}
,
"s3":{
       "bucket":"mybucket",
       "key":"csv/user.csv"
   }
}

cls_config.json

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
import json

class cls_config(object):

    def __init__(self,filename):

        self.filename = filename


    def getConfig(self):

        fileName = os.path.join(os.path.dirname(__file__), self.filename)
        with open(fileName) as f:
        config = json.load(f)
        return config

cls_pandas.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import pandas as pd
import io

class cls_pandas(object):

    def __init__(self):
        pass

    def read(self,stream):

        df = pd.read_csv(io.StringIO(stream), sep = ",")
        return df

cls_s3.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import boto3
import json

class cls_s3(object):

    def  __init__(self,access_key,secret_key):

        self.s3 = boto3.client('s3', aws_access_key_id=access_key, aws_secret_access_key=secret_key)

    def getObject(self,bucket,key):

        read_file = self.s3.get_object(Bucket=bucket, Key=key)
        body = read_file['Body'].read().decode('utf-8')
        return body

test.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from cls_config import *
from cls_s3 import *
from cls_pandas import *

class test(object):

    def __init__(self):
        self.conf = cls_config('config.json')

    def process(self):

        conf = self.conf.getConfig()

        bucket = conf['s3']['bucket']
        key = conf['s3']['key']

        access_key = conf['credential']['access_key']
        secret_key = conf['credential']['secret_key']

        s3 = cls_s3(access_key,secret_key)
        ob = s3.getObject(bucket,key)

        pa = cls_pandas()
        df = pa.read(ob)

        print df

if __name__ == '__main__':
    test = test()
    test.process()