23

I have already read through the answers available here and here and these do not help.

I am trying to read a csv object from S3 bucket and have been able to successfully read the data using the following code.

srcFileName="gossips.csv"
def on_session_started():
  print("Starting new session.")
  conn = S3Connection()
  my_bucket = conn.get_bucket("randomdatagossip", validate=False)
  print("Bucket Identified")
  print(my_bucket)
  key = Key(my_bucket,srcFileName)
  key.open()
  print(key.read())
  conn.close()

on_session_started()

However, if I try to read the same object using pandas as a data frame, I get an error. The most common one being S3ResponseError: 403 Forbidden

def on_session_started2():
  print("Starting Second new session.")
  conn = S3Connection()
  my_bucket = conn.get_bucket("randomdatagossip", validate=False)
  #     url = "https://s3.amazonaws.com/randomdatagossip/gossips.csv"
  #     urllib2.urlopen(url)

  for line in smart_open.smart_open('s3://my_bucket/gossips.csv'):
     print line
  #     data = pd.read_csv(url)
  #     print(data)

on_session_started2()

What am I doing wrong? I am on python 2.7 and cannot use Python 3.

Drj
  • 1,176
  • 1
  • 11
  • 26
  • 1
    Don't use those outdated example without knowing what you are doing. Go check out boto3 – mootmoot Apr 12 '17 at 14:25
  • 1
    Since you're already using [`smart_open`](https://github.com/RaRe-Technologies/smart_open), just do `data = pd.read_csv(smart_open.smart_open('s3://randomdatagossip/gossips.csv'))`. – kepler Jul 30 '18 at 11:06

3 Answers3

38

Here is what I have done to successfully read the df from a csv on S3.

import pandas as pd
import boto3

bucket = "yourbucket"
file_name = "your_file.csv"

s3 = boto3.client('s3') 
# 's3' is a key word. create connection to S3 using default config and all buckets within S3

obj = s3.get_object(Bucket= bucket, Key= file_name) 
# get object and file (key) from bucket

initial_df = pd.read_csv(obj['Body']) # 'Body' is a key word
Drj
  • 1,176
  • 1
  • 11
  • 26
  • 4
    This does not work with recent versions of pandas. See my answer to a similar question https://stackoverflow.com/a/46323684/1451649 that works for pandas 0.20.3 – jpobst Sep 20 '17 at 13:52
  • How ca I load only a subset of the csv instead of the whole file? – HuLu ViCa Jun 09 '23 at 17:08
29

This worked for me.

import pandas as pd
import boto3
import io

s3_file_key = 'data/test.csv'
bucket = 'data-bucket'

s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket, Key=s3_file_key)

initial_df = pd.read_csv(io.BytesIO(obj['Body'].read()))
  • Works with parquet as well as long as it is a single parquet file in the s3 key: df = pd.read_parquet(io.BytesIO(obj['Body'].read())) – mathisfun Dec 14 '20 at 23:10
  • Finally found something that worked out of the box for me. Thank you! – Sagar Sep 21 '22 at 15:55
1

Maybe you can try to use pandas read_sql and pyathena:

from pyathena import connect
import pandas as pd

conn = connect(s3_staging_dir='s3://bucket/folder',region_name='region')
df = pd.read_sql('select * from database.table', conn)
dasilvadaniel
  • 413
  • 4
  • 8