2

I want to upsert records from a large csv.gzip file. I am using generator of chunks, described in this response:

def gen_chunks(reader, chunksize=100):
    """ 
    Chunk generator. Take a CSV `reader` and yield
    `chunksize` sized slices. 
    """
    chunk = []
    for i, line in enumerate(reader):
        if (i % chunksize == 0 and i > 0):
            yield chunk
            del chunk[:]
        chunk.append(line)
    yield chunk

I run a daemon Mongod by the following command:

$ mongod --dbpath data\db

And then start python script with using pymongo:

with gzip.open(filepath, 'rt', newline='') as gzip_file:
    dr = csv.DictReader(gzip_file)  # comma is default delimiter
    chunksize = 10 ** 3

    for chunk in gen_chunks(dr, chunksize):
        bulk = locations.initialize_ordered_bulk_op()
        for row in chunk:
            cell = {
                'mcc': int(row['mcc']),
                'mnc': int(row['net']),
                'lac': int(row['area']),
                'cell': int(row['cell'])
            }
            location = {
                'lat': float(row['lat']),
                'lon': float(row['lon'])
            }
            bulk.find(cell).upsert().update({'$set': {'OpenCellID': location}})
        result = bulk.execute()

Then the RAM memory used by a process increases (sorry for my native language on a screenshot, RAM is the third column): RAM After the complete execution of the script (upserting about 30 million documents), memory used by mongod reached about 15 GB!

What am I doing wrong/misundersting?

P.S. After restarting the daemon RAM memory is reduced to the normal (about 30 Mb).

Community
  • 1
  • 1
Andrei
  • 1,313
  • 4
  • 18
  • 35

0 Answers0