I want to upsert records from a large csv.gzip file. I am using generator of chunks, described in this response:
def gen_chunks(reader, chunksize=100):
"""
Chunk generator. Take a CSV `reader` and yield
`chunksize` sized slices.
"""
chunk = []
for i, line in enumerate(reader):
if (i % chunksize == 0 and i > 0):
yield chunk
del chunk[:]
chunk.append(line)
yield chunk
I run a daemon Mongod by the following command:
$ mongod --dbpath data\db
And then start python script with using pymongo:
with gzip.open(filepath, 'rt', newline='') as gzip_file:
dr = csv.DictReader(gzip_file) # comma is default delimiter
chunksize = 10 ** 3
for chunk in gen_chunks(dr, chunksize):
bulk = locations.initialize_ordered_bulk_op()
for row in chunk:
cell = {
'mcc': int(row['mcc']),
'mnc': int(row['net']),
'lac': int(row['area']),
'cell': int(row['cell'])
}
location = {
'lat': float(row['lat']),
'lon': float(row['lon'])
}
bulk.find(cell).upsert().update({'$set': {'OpenCellID': location}})
result = bulk.execute()
Then the RAM memory used by a process increases (sorry for my native language on a screenshot, RAM is the third column):
After the complete execution of the script (upserting about 30 million documents), memory used by mongod reached about 15 GB!
What am I doing wrong/misundersting?
P.S. After restarting the daemon RAM memory is reduced to the normal (about 30 Mb).