I'm writing an AWS Lambda in Python 3.6 I have a large amount of large space separated text files and I need to loop through these files and pull out the first N (in this case 1000) lines of text. Once I have those lines I need to put them in to a new file and upload that to S3.
I'm also not a python developer, so the language and environment is new to me.
Right now I'm collecting the S3 object summaries, and for each of those, I'm running a check on them and then getting the object's data, opening that as a file-like object and also opening the output variable as a file-like object, and then doing my processing.
I've given my Lambda 3GB RAM but the lambda is running out of memory before it can process any files (Each file is about 800MB and there are about 210 of them).
for item in object_summary:
# Check if the object exists, and skip it if so
try:
head_object_response = s3Client.head_object(Bucket=target_bucket_name, Key=item)
logger.info('%s: Key alredy exists.' % item)
except:
# if the key does not exist, we need to swallow the 404 that comes from boto3
pass
# and then do our logic to headify the files
logger.info('Key does not exist in target, headifying: %s' % item)
# If the file doesn't exist, get the full object
s3_object = s3Client.get_object(Bucket=inputBucketName, Key=item)
long_file = s3_object['Body']._raw_stream.data
file_name = item
logger.info('%s: Processing 1000 lines of input.' % file_name)
'''
Looks like the Lambda hits a memory limit on the line below.
It crashes with 2500MB of memory used, the file it's trying
to open at that stage is 800MB large which puts it over the
max allocation of 3GB
'''
try:
with open(long_file, 'r') as input_file, open(file_name, 'w') as output_file:
for i in range(1000):
output_file.write(input_file.readline())
except OSError as exception:
if exception.errno ==36:
logger.error('File name: %s' %exception.filename)
logger.error(exception.__traceback__)
I put the whole function for completeness above, but I think that the specific area I can improve it is the try: while:
block that handles the file processing.
Have I got that right? Is there anywhere else I can improve it?