The previous answers are a good basic start but I wanted to achieve advanced objectives stated below. Overall I feel awswrangler is the way to go.
- read .gzip
- read only the first 5 lines without downloading the full file
- explicitly pass credentials (make sure you don't commit them to code!!)
- use full s3 paths
Here are a couple of things that worked
import boto3
import pandas as pd
import awswrangler as wr
boto3_creds = dict(region_name="us-east-1", aws_access_key_id='', aws_secret_access_key='')
boto3.setup_default_session(**boto3_creds)
s3 = boto3.client('s3')
# read first 5 lines from file path
obj = s3.get_object(Bucket='bucket', Key='path.csv.gz')
df = pd.read_csv(obj['Body'], nrows=5, compression='gzip')
# read first 5 lines from directory
dft_xp = pd.concat(list(wr.s3.read_csv(wr.s3.list_objects('s3://bucket/path/')[0], chunksize=5, nrows=5, compression='gzip')))
# read all files into pandas
df_xp = wr.s3.read_csv(wr.s3.list_objects('s3://bucket/path/'), compression='gzip')
Did not use s3fs wasn't sure if it uses boto3.
For distributed compute with dask, this worked but it uses s3fs afaik and apparently gzip can't be parallized.
import dask.dataframe as dd
dd.read_csv('s3://bucket/path/*', storage_options={'key':'', 'secret':''}, compression='gzip').head(5)
dd.read_csv('s3://bucket/path/*', storage_options={'key':'', 'secret':''}, compression='gzip')
# Warning gzip compression does not support breaking apart files Please ensure that each individual file can fit in memory