2

So, i want to read a large CSV file from an S3 bucket, but i dont want that file to be completely downloaded in memory, what i wanna do is somehow stream the file in chunks and then process it.

So far this is what i have done, but i dont think so this is gonna solve the problem.

import logging
import boto3
import codecs
import os
import csv

LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)

s3 = boto3.client('s3')


def lambda_handler(event, context):
    # retrieve bucket name and file_key from the S3 event
    bucket_name = event['Records'][0]['s3']['bucket']['name']
    file_key = event['Records'][0]['s3']['object']['key']
    chunk, chunksize = [], 1000
    if file_key.endswith('.csv'):
        LOGGER.info('Reading {} from {}'.format(file_key, bucket_name))

        # get the object
        obj = s3.get_object(Bucket=bucket_name, Key=file_key)
        file_object = obj['Body']
        count = 0
        for i, line in enumerate(file_object):
            count += 1
            if (i % chunksize == 0 and i > 0):
                process_chunk(chunk)
                del chunk[:]
            chunk.append(line)





def process_chunk(chuck):
    print(len(chuck))
noobie-php
  • 6,817
  • 15
  • 54
  • 101
  • https://stackoverflow.com/questions/28618468/read-a-file-line-by-line-from-s3-using-boto – Ninad Gaikwad Feb 15 '19 at 16:29
  • Use https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody iter_chunks() or iter_lines() – balderman Feb 15 '19 at 17:11
  • Side-note: As long as the object is under 500MB, you could download it to `/tmp` and then just process it like a normal local file. – John Rotenstein Feb 16 '19 at 05:53

2 Answers2

3

This will do what you want to achieve. It wont download the whole file in the memory, instead will download in chunks, process and proceed:

  from smart_open import smart_open
  import csv

  def get_s3_file_stream(s3_path):
      """
      This function will return a stream of the s3 file.
      The s3_path should be of the format: '<bucket_name>/<file_path_inside_the_bucket>'
      """
      #This is the full path with credentials:
      complete_s3_path = 's3://' + aws_access_key_id + ':' + aws_secret_access_key + '@' + s3_path
      return smart_open(complete_s3_path, encoding='utf8')

  def download_and_process_csv:
      datareader = csv.DictReader(get_s3_file_stream(s3_path))
      for row in datareader:
          yield process_csv(row) # write a function to do whatever you want to do with the CSV
  • What is the utility of the yield statement when a row is going to be processed? I'm writing a process_csv(row) method and the row is never printed. I'm not sure how to use this code and how to process a particular row. – eduardosufan Sep 21 '22 at 13:56
-3

Did u try AWS Athena https://aws.amazon.com/athena/ ? its extremely good serverless and pay as go. Without dowloading the file it does everything what you want. BlazingSql is open source and its also usefull in case of big data problem.

prashant thakre
  • 5,061
  • 3
  • 26
  • 39