1

I'm writing an AWS Lambda in Python 3.6 I have a large amount of large space separated text files and I need to loop through these files and pull out the first N (in this case 1000) lines of text. Once I have those lines I need to put them in to a new file and upload that to S3.

I'm also not a python developer, so the language and environment is new to me.

Right now I'm collecting the S3 object summaries, and for each of those, I'm running a check on them and then getting the object's data, opening that as a file-like object and also opening the output variable as a file-like object, and then doing my processing.

I've given my Lambda 3GB RAM but the lambda is running out of memory before it can process any files (Each file is about 800MB and there are about 210 of them).

    for item in object_summary:
        # Check if the object exists, and skip it if so
        try:
            head_object_response = s3Client.head_object(Bucket=target_bucket_name, Key=item)
            logger.info('%s: Key alredy exists.' % item)
        except:
            # if the key does not exist, we need to swallow the 404 that comes from boto3
            pass

        # and then do our logic to headify the files
        logger.info('Key does not exist in target, headifying: %s' % item)

        # If the file doesn't exist, get the full object
        s3_object = s3Client.get_object(Bucket=inputBucketName, Key=item)
        long_file = s3_object['Body']._raw_stream.data
        file_name = item
        logger.info('%s: Processing 1000 lines of input.' % file_name)

        '''
        Looks like the Lambda hits a memory limit on the line below.
        It crashes with 2500MB of memory used, the file it's trying 
        to open at that stage is 800MB large which puts it over the 
        max allocation of 3GB
        '''
        try:
            with open(long_file, 'r') as input_file, open(file_name, 'w') as output_file:
                for i in range(1000):
                    output_file.write(input_file.readline())
        except OSError as exception:
            if exception.errno ==36:
                logger.error('File name: %s' %exception.filename)
                logger.error(exception.__traceback__)

I put the whole function for completeness above, but I think that the specific area I can improve it is the try: while: block that handles the file processing.

Have I got that right? Is there anywhere else I can improve it?

Alex
  • 177
  • 12
  • What is your concurrency set to? The memory allocation is for all running lambdas, so if you are trying to process >2 files at once this may be your problem. – AndrewH Jul 17 '19 at 23:01
  • This may be better in place over at the guys in code review https://codereview.stackexchange.com/ – Nathan Jul 17 '19 at 23:03
  • Or, read the file byte by byte from s3 instead of all at once into memory: https://stackoverflow.com/a/40661459/5724723, i.e. iterate through `_raw_stream` – AndrewH Jul 17 '19 at 23:07
  • "I put the whole function" - no, you did not.. Please, do that. Put all the code that seens like "boiler plate" - without that these are a bunch of lines that are hard to refactor in any way meaningful. The fix for this will be to refactor things into lazy generators, but I can't take lines to outside a function I am not seeing! – jsbueno Jul 18 '19 at 00:19
  • 2
    @AndrewH the AWS Lambda memory limit setting is not for "all running lambdas". It applies to each Lambda instance separately. https://docs.aws.amazon.com/lambda/latest/dg/resource-model.html "Memory – The amount of memory available to the function during execution." – Mark B Jul 18 '19 at 14:48

2 Answers2

1

Try to check yor logs or traceback for the exact line of the the error - the point you point to in the code really will read one line at a time (with the OS behind the scenes casing stuff, but that would be a couple hundred KB at most).

It is more likely that methods such as s3Client.get_object(Bucket=inputBucketName, Key=item) or attribute accesses like long_file = s3_object['Body']._raw_stream.data are eagerly bringing the file actual contents in to memory,.

You have to check the docs for those, and how to stream data from the S3 and dump it to disk, instead of having it all in memory. The fact that the attribute is named ._raw_stream, beggining with an _ indicate it is a private attribute, and it is not advised to make use of it directly.

Also, you are using pass which does nothing, the remaining of the loop will run the sameway - you might want to use continue there. And an empty except clause , without logging the error, is among the worst mistakes possible in Python code - if there is an error there, you have to log it, not just "pretend it did not happen". (It is even ilegal syntax in Python 3)

jsbueno
  • 99,910
  • 10
  • 151
  • 209
  • These are some really good suggestions, thank you. I'm still learning python so stuff like eager load and the specific syntax of python is new to me also. I appreciate the advice – Alex Jul 18 '19 at 20:48
1

Think more simply.

I suggest just handling a single file per lambda call, then you should be within your 3GB easily. In anycase, with an increase in the number of files to process eventually your lambda function will hit the max 15 minute execution limit, so it's better to think of lambda processing in roughly consistently sized chunks.

If necessary you can introduce an intermediate chunker lambda function to chunk out the processing.

If your files are really only 800MB I would think that your processing should be ok in terms of memory. The input file may still be streaming in, you may want to try deleting it (del s3_object['Body']?)

from io import StringIO

def handle_file(key_name):
    # Check if the object exists, and skip it
    try:
        head_object_response = s3Client.head_object(
            Bucket=target_bucket_name, 
            Key=item
        )
        logger.info(f'{item} - Key already exists.')   
        return None, 0
    except ClientError as e:
        logger.exception(e)
        logger.info(f'{item} - Does Not exist.')           

    # If the file doesn't exist, get the full object
    s3_object = s3Client.get_object(Bucket=inputBucketName, Key=item)
    long_file = StringIO(s3_object['Body'])

    max_lines = 1000
    lines = []
    for n, line in enumerate(long_file):
        lines.append(line)
        if len(lines) == max_lines:
            break

    output = StringIO()
    output.writelines(lines)    
    output.seek(0)
    response = s3Client.put_object(Body=output, Bucket=outputBucketName, Key=item)
    return item, len(lines)

As a side note I really recommend zappa if your using lambda, it makes lambda development fun. (and it would make chunking out code sections easy in the same code using Asynchronous Task Execution)

monkut
  • 42,176
  • 24
  • 124
  • 155
  • 1
    Awesome feedback monkut, thanks! It never occurred to me to just write the lambda to handle a single file and call it repeatedly. I'll also take a look at zappa, thank you for the suggestion :) – Alex Jul 18 '19 at 20:46
  • Also, handling 1 file per lambda call allows you to process all files in parallel. ;) – monkut Jul 19 '19 at 01:24