9

I am trying to download a csv file from an s3 bucket using the s3fs library. I have noticed that writing a new csv using pandas has altered data in some way. So I want to download the file directly in its raw state.

The documentation has a download function but I do not understand how to use it:

download(self, rpath, lpath[, recursive]): Alias of FilesystemSpec.get.

Here's what I tried:

import pandas as pd
import datetime
import os
import s3fs
import numpy as np

#Creds for s3
fs = s3fs.S3FileSystem(key=mykey, secret=mysecretkey)
bucket = "s3://mys3bucket/mys3bucket"
files = fs.ls(bucket)[-3:]


#download files:
for file in files:
    with fs.open(file) as f:
        fs.download(f,"test.csv")

AttributeError: 'S3File' object has no attribute 'rstrip'
Andrew Gaul
  • 2,296
  • 1
  • 12
  • 19
Jacky
  • 710
  • 2
  • 8
  • 27

2 Answers2

11
for file in files:
    fs.download(file,'test.csv')

Modified to download all files in the directory:

import pandas as pd
import datetime
import os
import s3fs
import numpy as np

#Creds for s3
fs = s3fs.S3FileSystem(key=mykey, secret=mysecretkey)
bucket = "s3://mys3bucket/mys3bucket"

#files references the entire bucket.
files = fs.ls(bucket)

for file in files:
    fs.download(file,'test.csv')
Jacky
  • 710
  • 2
  • 8
  • 27
3

I'm going to copy my answer here as well since I used this in a more general case:

# Access Pando
import s3fs
#Blocked out url as "enter url here" for security reasons
fs = s3fs.S3FileSystem(anon=True, client_kwargs={'endpoint_url':"enter url here"})

# List objects in a path and import to array
# -3 limits output for testing purposes to prevent memory overload
files = fs.ls('hrrr/sfc/20190101')[-3:]

#Make a staging directory that can hold data as a medium
os.mkdir("Staging")

#Copy files into that directory (specific directory structure requires splitting strings)
for file in files:
    item = str(file)
    lst = item.split("/")
    name = lst[3]
    path = "Staging\\" + name
    print(path)
    fs.download(file, path)

Note that the documentation is fairly barren for this particular python package. I was able to find some documentation regarding what arguments s3fs takes here (https://readthedocs.org/projects/s3fs/downloads/pdf/latest/). The full arguments list is toward the end, though they don't specify what the parameters mean. Here's a general guide for s3fs.download:

-arg1 (rpath) is the source directory for where you are getting the files from. As in both above answers, the best way to obtain this is to do an fs.ls on your s3 bucket and save that to a variable

-arg2 (lpath) is the destination directory and file name. Note that without a valid output file, this will return the Attribute Error OP got. I have this defined as a path variable

-arg3 is an optional parameter to choose to perform the download recursively

Zach Rieck
  • 419
  • 1
  • 4
  • 23