0

I am trying to process all records of a large file from s3 using python in batch of N no of line. i have to fetch N no of line per iteration. each line has some json object.

Here are some things I've already tried:

1) I tried the solution mentioned here Streaming in / chunking csv's from S3 to Python but it breaks my json structure while reading bytes of data.

2)

obj = s3.get_object(Bucket=bucket_name, Key=fname)
data=obj['Body'].read().decode('utf-8').splitlines()

It take more time to read large file with 100k lines. it will return list of lines which we can further iterate to get number of line from data variable.

Vallabh
  • 93
  • 2
  • 14

2 Answers2

0

Probably smart_open does the trick.

pip install smart_open[s3] 

After installing it...

from smart_open import open

client = boto3.client("s3")
transport_params = {'client': client}
with open('s3://%s/%s' % (bucket_name, fname), 'wb', transport_params=transport_params, encoding='utf-8') as f:
    for line in f:
        print(json.loads(line))

You could use iter_lines too:

obj = s3.get_object(Bucket=bucket_name, Key=fname)
for line in obj['Body'].iter_lines(chunk_size=1024, keepends=False):
    print(json.loads(line))
goncuesma
  • 111
  • 1
  • 5
  • Hi.. this will give me a single line and chunk size is in bytes. show it wont return me N no of lines as required. – Vallabh Mar 15 '21 at 06:14
0

Those who are looking for similar solution. i have utilized pandas library to get N no of Lines in loop.

Below is my code implementation it will give 50 lines per iteration

for records in pd.read_json(obj['Body'].read().decode('utf-8'), lines=True, chunksize=50):
    print(records)
Vallabh
  • 93
  • 2
  • 14