0

I recently migrated my workflow from AWS to a local computer. The files I need are still stored on S3 private buckets. I've been able to set up my environmental variables correctly, where all I need to do is import s3fs and then and I can read files very conveinetly in pandas like this:

pd.read_csv('S3://my-bucket/some-file.csv')

And it works perfectly. This is nice because I don't need to change any code and reading/writing files works well.

However reading files from S3 is incredibly slow, and even more so now that I'm working locally. I've been googling and I've found that s3fs appears to support caching files locally, so after we've read the file the first time from S3, s3fs can store the file locally and the next time we read the file it will read the local file and works much faster. This is perfect for my workflow, where I will be iterating on the same data many times.

However I can't find anything about how to set this up with the pandas native s3fs implementation. This post describes how to cache files with s3fs, however the wiki linked in the answer is for something called fuse-s3fs. I don't see a way to specify a use_cache option in native s3fs.

In pandas all the s3fs setup is done behind-the-scenes, and it seems right now by default that when I read a file from S3 in pandas, and I read the same file again, it takes just as long to read, so I don't believe there is any caching taking place.

Does anyone know how to set up pandas with s3fs so that it caches all files that is has read?

Thanks!

Andrew Gaul
  • 2,296
  • 1
  • 12
  • 19
jeffery_the_wind
  • 17,048
  • 34
  • 98
  • 160

0 Answers0