How can I read files from s3 using pandas?

Question

I am trying to read files in my buckets that are a mix of csv/xslx and I am getting a 403 which I do not quite understand since I am setting AWS creds through the keychain and env vars. I am using a URL over https, when I switch the URL to s3:// it tells me the bucket doesn't exist which it definitely does. I have s3fs installed as well.

TLDR: Https throws 403s, s3:// throws bucket doesn't exist when it does.

Code:

def get_file(project_name, uid) -> list:
    files = []
    s3 = boto3.resource('s3', region_name='us-east-2')
    bucket_str = 'stackstr-' + uid
    url = 'https://' + bucket_str + '.s3.us-east-2.amazonaws.com/'
    bucket = s3.Bucket(bucket_str)
    for obj in bucket.objects.filter(Prefix=project_name + '/raw_datasets'):
        link = url + obj.key
        files.append(link)
    print(files)
    return files


def generate_dataframes(files) -> pd.DataFrame:
    df_list = []
    for fname in files:
        ext = fname.split(".")[-1]
        if ext == 'xlsx':
            df = pd.read_excel(fname)
            df_list.append(df)

        if ext == 'csv':
            df = pd.read_csv(fname)
            df_list.append(df)

    print(df_list)

Does this answer your question? [How to import a text file on AWS S3 into pandas without writing to disk](https://stackoverflow.com/questions/37703634/how-to-import-a-text-file-on-aws-s3-into-pandas-without-writing-to-disk) — Michael Delgado, Jun 08 '20 at 02:11
@MichaelDelgado Not really, I have read that post and according to another post on StackOverflow you can send full s3 URLs into read_excel as well as read_csv. I am getting creds from env vars via AWS Cred chain but its apparently not working within the second function? unless you cant run a request to a s3 URL? — dmc94, Jun 08 '20 at 02:15
Specifically, if reading directly from the URL with pd.read_csv or excel, don't include the `.s3.us-east-2.amazonaws.com` suffix. just `s3://[bucket_name]/[blob-path]`. see the [s3fs docs](https://s3fs.readthedocs.io/en/latest/) for examples/credential information. — Michael Delgado, Jun 08 '20 at 02:17
@MichaelDelgado awesome thank you! cant belive I missed that. — dmc94, Jun 08 '20 at 02:26

score 0 · Answer 1 · answered Jun 08 '20 at 02:37

Michael Delgado provided the correct answer below:

Specifically, if reading directly from the URL with pd.read_csv or excel, don't include the .s3.us-east-2.amazonaws.com suffix. just s3://[bucket_name]/[blob-path]. see the s3fs docs for examples/credential information

How can I read files from s3 using pandas?

1 Answers1