python - s3 Boto3 read N number of lines per iteration from large file

Question

I am trying to process all records of a large file from s3 using python in batch of N no of line. i have to fetch N no of line per iteration. each line has some json object.

Here are some things I've already tried:

1) I tried the solution mentioned here Streaming in / chunking csv's from S3 to Python but it breaks my json structure while reading bytes of data.

2)

obj = s3.get_object(Bucket=bucket_name, Key=fname)
data=obj['Body'].read().decode('utf-8').splitlines()

It take more time to read large file with 100k lines. it will return list of lines which we can further iterate to get number of line from data variable.

score 0 · Answer 1 · answered Mar 12 '21 at 13:01

Probably smart_open does the trick.

pip install smart_open[s3]

After installing it...

from smart_open import open

client = boto3.client("s3")
transport_params = {'client': client}
with open('s3://%s/%s' % (bucket_name, fname), 'wb', transport_params=transport_params, encoding='utf-8') as f:
    for line in f:
        print(json.loads(line))

You could use iter_lines too:

obj = s3.get_object(Bucket=bucket_name, Key=fname)
for line in obj['Body'].iter_lines(chunk_size=1024, keepends=False):
    print(json.loads(line))

Hi.. this will give me a single line and chunk size is in bytes. show it wont return me N no of lines as required. — Vallabh, Mar 15 '21 at 06:14

score 0 · Answer 2 · answered Mar 15 '21 at 06:20

Those who are looking for similar solution. i have utilized pandas library to get N no of Lines in loop.

Below is my code implementation it will give 50 lines per iteration

for records in pd.read_json(obj['Body'].read().decode('utf-8'), lines=True, chunksize=50):
    print(records)

python - s3 Boto3 read N number of lines per iteration from large file

2 Answers2