2

I have a list of dictionaries that represents a CSV file, and I would like to write them to S3, however I am getting a memory error. Here Id my code:

import csv
import io

dicts = [] # populated with about 1,000,000 dictionaries representing a CSV
f = io.StringIO()
writer = csv.DictWriter(f, fieldnames=dicts[0].keys())
writer.writeheader()
            
for k in dicts:
    writer.writerow(k)
            
print("Writing to S3...")
response = s3.upload_fileobj(Bucket='mybucket', Key=f"key.csv", Fileobj=f.getvalue())
f.close()

However, when I run this I get the following error:

[ERROR] MemoryErrorTraceback (most recent call last):  
File "/var/task/lambda_function.py", line 85, in lambda_handler
response = s3.upload_fileobj(Bucket='mybucket', Key=f"key.csv", Fileobj=f.getvalue())

How can I go about writing this to S3 in a more memory efficient way? the CSV is about 400mb and has around 1,000,000 rows.

EDIT:

I have the max amount of memory available, here is the report from lambda:

REPORT RequestId: c8f651cf-9869-4217-921f-52edcf577234  
Duration: 123484.03 ms  
Billed Duration: 123485 ms  
Memory Size: 10240 MB   
Max Memory Used: 10043 MB   
Init Duration: 453.23 ms    

I have run a memory profiler and the vast majority of the memory is used writing to f and f.getvalue() unsurprisingly

EDIT:

Here is the full lambda function code:

for i in event['files']:
    try:
        file = s3.get_object(Bucket="incomingbucket", Key=i)
        print(file)
    except Exception as e:
        print(e, i)

    file_id = str(uuid.uuid4())
    jsonRootLs = i.split(".")
    if len(jsonRootLs) > 1:
        jsonRoot = '.'.join(j for j in jsonRootLs[0:len(jsonRootLs)-1])
        jsonFileName = f"{jsonRoot}.json"
    else:
        jsonRoot = jsonRootLs[0]
        jsonFileName = f"{jsonRoot}.json"
        
    mapper = s3.get_object(Key=jsonFileName, Bucket='slm-addressfile-incoming')
    mapperJSON = json.loads(mapper['Body'].read().decode('utf-8'))

    dicts = modelerFile(file, mapperJSON)
    for j in dicts:
        j['mail_filename'] = i
        j['file_id'] = file_id
    dictsToSend.extend(dicts)
    print("Records added to list")
        
    f = io.StringIO()
    writer = csv.DictWriter(f, fieldnames=dicts[0].keys())
    writer.writeheader()
    
    for k in dicts:
        writer.writerow(k)
    
    print("Writing to S3...")
    response = s3.upload_fileobj(Bucket='slm-test-bucket-transactional', Key=f"{jsonRoot}.csv", Fileobj=f.getvalue())
    f.close()

# Function to re map columns
def customFile(file, mapperjson):
    NCOAFields = mapperjson['mappings']
    lines1 = []
    for line in file['Body'].iter_lines():
        lines1.append(line.decode('utf-8', errors='ignore'))

    fieldnames = lines1[0].replace('"','').split(',')
    jlist1 = (dict(row) for row in csv.DictReader(lines1[1:], fieldnames))
    
    dicts = []
    for i in jlist1:
        d = {}
        metadata = {}
        for k, v in i.items():
            if k in NCOAFields:
                d[NCOAFields[k]] = v
            else:
                metadata[k] = v
        if len(metadata) > 0:
            d['metadata'] = metadata
        d['individual_id'] = str(uuid.uuid4())
        dicts.append(d)
        
    del jlist1

    return dicts

Basically it reads a CSV rom S3 which also has a JSON mapping file to change names of the columns to our destination schema

DBA108642
  • 1,995
  • 1
  • 18
  • 55
  • 1
    What are the memory settings on the Lambda function currently? Have you tried simply increasing the memory available? https://aws.amazon.com/about-aws/whats-new/2020/12/aws-lambda-supports-10gb-memory-6-vcpu-cores-lambda-functions/ – Mark B Feb 16 '21 at 18:46
  • yes, I have the maximum amount of memory. will update the post – DBA108642 Feb 16 '21 at 19:02
  • Uhhh I'm skeptical that the file size is your problem. Your file is 400mb, your Lambda memory is 10gb... That means a 25x difference. In other words there are 9.6gb of RAM unaccounted for. That's a lot. This seems like a memory leak. – MyStackRunnethOver Feb 16 '21 at 20:47
  • @MyStackRunnethOver I will update the post with the full function code – DBA108642 Feb 16 '21 at 20:50
  • What is `dictsToSend`? It only appears once and you don't do anything with it – MyStackRunnethOver Feb 17 '21 at 21:12

1 Answers1

1

I can't find anything in the code that should obviously be taking up a ton of memory (particularly: taking up memory across for iterations without releasing it in between iterations). You're closing the StringIO virtual file, which would be my prime suspect.

Given what you've said about memory profiling, here are possible solutions:

  1. Change
response = s3.upload_fileobj(..., Fileobj=f.getValue())

to

response = s3.upload_fileobj(..., Fileobj=f)

This should avoid making a copy of the buffer (f) as a String in memory. This will take a single significant chunk out of memory usage - this may or may not be enough.

  1. Refactor your code to stream your data - specifically, most of your collections are created, then iterated through once, then never used again. Instead, you could operate entry-by-entry across your data, doing all your transforms to each datapoint one-by-one. Unless you use multi-part upload you'll still need to hold all your data in memory before uploading it to S3, but this should still reduce memory usage.

  2. (This is a bit of a nuclear option) at the end of your main loop, set your variables to None and trigger garbage collection.

I would prefer 1 and/or 2 to 3. If 3 does work, I would be suspicious that something else is going wrong.

MyStackRunnethOver
  • 4,872
  • 2
  • 28
  • 42
  • thanks for the suggestions, I will definitely try these – DBA108642 Feb 18 '21 at 14:28
  • 1
    (great name btw), the smart_open route did work for me, im only using about 5 gigs of memory as opposed to blowing up at 10. the 400mb file becomes 1.5 Gb in size after transforming it so I guess that is just the way it is – DBA108642 Feb 18 '21 at 18:57