13

How to read a parquet file on s3 using dask and specific AWS profile (stored in a credentials file). Dask uses s3fs which uses boto. This is what I have tried:

>>>import os
>>>import s3fs
>>>import boto3
>>>import dask.dataframe as dd

>>>os.environ['AWS_SHARED_CREDENTIALS_FILE'] = "~/.aws/credentials"

>>>fs = s3fs.S3FileSystem(anon=False,profile_name="some_user_profile")
>>>fs.exists("s3://some.bucket/data/parquet/somefile")
True
>>>df = dd.read_parquet('s3://some.bucket/data/parquet/somefile')
NoCredentialsError: Unable to locate credentials
gerrit
  • 24,025
  • 17
  • 97
  • 170
muon
  • 12,821
  • 11
  • 69
  • 88

1 Answers1

12

Never mind, that was easy, but did not find any reference online, so here it is:

>>>import os
>>>import dask.dataframe as dd
>>>os.environ['AWS_SHARED_CREDENTIALS_FILE'] = "/path/to/credentials"

>>>df = dd.read_parquet('s3://some.bucket/data/parquet/somefile',
                      storage_options={"profile_name":"some_user_profile"})
>>>df.head()
# works
muon
  • 12,821
  • 11
  • 69
  • 88
  • 3
    Documentation [here](http://dask.pydata.org/en/latest/remote-data-services.html#s3) - please feel free to submit improvements as a PR if you think it could be clearer. – mdurant Jan 22 '18 at 21:21
  • 2
    Thanks for posting both your question and answer online! Hopefully your efforts help others in the future. – MRocklin Jan 22 '18 at 21:32
  • @mdurant thanks I see it now, I did skim over that documentation page but missed it :( – muon Jan 22 '18 at 22:21
  • @muon , no problem! We are aware that the docs pages are rather voluminous :) – mdurant Jan 22 '18 at 22:41
  • this is not working with pd.read_parquet. Getting '''read_table() got an unexpected keyword argument 'storage_options''' – Eduardo EPF Dec 03 '21 at 13:54