least memory comsuming way of saving large python data into DB

Question

I have to save a significantly large python data into the a mysql database which consists of lists and dictionaries but I get memory exception during the save operation.

I have already benchmarked the saving operation and also tried different ways of dumping the data, including binary format but all methods seemed to consume a lot of memory. Benchmarks below:

MAX MEMORY USAGE DURING JSON SAVE: 966.83MB

SIZE AFTER DUMPING json: 81.03 MB pickle: 66.79 MB msgpack: 33.83 MB

COMPRESSION TIME: json: 5.12s pickle: 11.17s msgpack: 0.27s

DECOMPRESSION TIME: json: 2.57s pickle: 1.66s msgpack: 0.52s

COMPRESSION MAX MEMORY USAGE: json dumping: 840.84MB pickle: 1373.30MB msgpack: 732.67MB

DECOMPRESSION MAX MEMORY USAGE: json: 921.41MB pickle: 1481.25MB msgpack: 1006.12MB

msgpack seems to be the most performant library but the decompression takes up a lot of memory too. I also tried hickle which is said to consume little memory but the final size ended up being 800MB.

Does anyone have a suggestion? Should I just increase the memory limit? can mongodb handle the save operation with less memory?

find below the stacktrace

Traceback (most recent call last):
  File "/opt/python/bundle/32/app/web_platform/kernel/kernel_worker/web_platform/call_kernel.py", line 139, in start_simulation
    simulation_job_object.save()
  File "/opt/python/bundle/32/app/web_platform/kernel/kernel_worker/web_platform/models.py", line 172, in save
    self.clean_fields()
  File "/opt/python/run/venv/local/lib/python3.6/site-packages/django/db/models/base.py", line 1223, in clean_fields
    setattr(self, f.attname, f.clean(raw_value, self))
  File "/opt/python/run/venv/local/lib/python3.6/site-packages/django/db/models/fields/__init__.py", line 630, in clean
    self.validate(value, model_instance)
  File "/opt/python/run/venv/local/lib/python3.6/site-packages/jsonfield/fields.py", line 54, in validate
    self.get_prep_value(value)
  File "/opt/python/bundle/32/app/web_platform/kernel/kernel_worker/web_platform/models.py", line 156, in get_prep_value
    return json.dumps(value, **self.encoder_kwargs)
  File "/usr/lib64/python3.6/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib64/python3.6/json/encoder.py", line 202, in encode
    return ''.join(chunks)
MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/python/run/venv/local/lib/python3.6/site-packages/background_task/tasks.py", line 43, in bg_runner
    func(*args, **kwargs)
  File "/opt/python/bundle/32/app/web_platform/kernel/kernel_worker/web_platform/call_kernel.py", line 157, in start_simulation
    simulation_job_object.save()
  File "/opt/python/bundle/32/app/web_platform/kernel/kernel_worker/web_platform/models.py", line 172, in save
    self.clean_fields()
  File "/opt/python/run/venv/local/lib/python3.6/site-packages/django/db/models/base.py", line 1223, in clean_fields
    setattr(self, f.attname, f.clean(raw_value, self))
  File "/opt/python/run/venv/local/lib/python3.6/site-packages/django/db/models/fields/__init__.py", line 630, in clean
    self.validate(value, model_instance)
  File "/opt/python/run/venv/local/lib/python3.6/site-packages/jsonfield/fields.py", line 54, in validate
    self.get_prep_value(value)
  File "/opt/python/bundle/32/app/web_platform/kernel/kernel_worker/web_platform/models.py", line 156, in get_prep_value
    return json.dumps(value, **self.encoder_kwargs)
  File "/usr/lib64/python3.6/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib64/python3.6/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib64/python3.6/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
MemoryError

example of my code

class Calculation(Model):
        name = db_models.CharField(max_length=120)
        results = JsonNullField(blank=True, null=True)

results = run_calculation()
calculation = Calculation(name="calculation", results=results)
calculation.save()

Can you please paste your code that you are using for the bulk insertion? Best idea is to use load the json, use yield to create chunks of objects and then `bulk_create` for the insertion. request a gc after each chunk to free up memory. — 2ps, Apr 06 '20 at 21:38
@2ps in python, you rarely need to "request a gc". In general, the memory management strategy is reference counting, which you don't control directly. `gc` is the cyclic garbage collector, which is only useful for reference cycles. — juanpa.arrivillaga, Apr 06 '20 at 21:42
@2ps I am using django, so I just set the model attribute and call .save() — Akira Kotsugai, Apr 06 '20 at 21:46
@AkiraKotsugai `sys.getsizeof` will not give you the total size of your data unless you call it recursively. It will just give the size of the outer container object — Iain Shelvington, Apr 06 '20 at 21:46
@AkiraKotsugai please look at this answer for how to determine the size of an object in python https://stackoverflow.com/questions/449560/how-do-i-determine-the-size-of-an-object-in-python/30316760#30316760 — Iain Shelvington, Apr 06 '20 at 21:49
@IainShelvington thanks Iain, but does it really matter? I already know that it is a really large dataset that becomes 80mb if converted to json. I have downloaded the json and checked the file size as well — Akira Kotsugai, Apr 06 '20 at 21:50
@AkiraKotsugai and where is the exception being thrown? Can you add the traceback to the question? — Iain Shelvington, Apr 06 '20 at 22:01
@IainShelvington done. Because I run my code in a thread, it is caught by django-background-tasks — Akira Kotsugai, Apr 06 '20 at 22:05
@IainShelvington the exception is thrown at the very moment I run model_object.save() — Akira Kotsugai, Apr 06 '20 at 22:12

score 1 · Accepted Answer · answered Apr 06 '20 at 21:54

1

In essence, here is how I would do it to reduce memory consumption and improve performance:

Load json file (no way to stream it in python AFAIK)
Chunk the array of dictionaries into smaller chunks
Convert chunk into objects
Call bulk_create
Garbage collect after every loop iteration

import json
import gc
from myapp.models import MyModel

filename = '/path/to/data.json'
with open(filename, 'r') as f:
    data = json.load(f)
chunk_size = 100
while data:
    chunk = data[:chunk_size]
    data = data[chunk_size:]
    chunk = [ MyModel(**x) for x in chunk ]
    MyModel.objects.bulk_create(chunk)
    gc.collect()

You can play with chunk_size to optimize performance / memory consumption.

answered Apr 06 '20 at 21:54

2ps

15,099
2
27
47

If I understood well I would have to create a separate table only for the slices, right? Is it possible to create chunks of the final array and save them progressively but as one instead of multiple entries? – Akira Kotsugai Apr 07 '20 at 12:59
No, no separate table is necessary. – 2ps Apr 07 '20 at 15:14
but you create new model objects for each element in the list right? What I would like to achieve is save the entire list as one model attribute, not multiple database rows. is that possible? – Akira Kotsugai Apr 07 '20 at 16:38
@AkiraKotsugai: please post your code of what you have so far, and I'm happy to help optimize. – 2ps Apr 08 '20 at 15:34
I updated the question. Please keep in mind that I am only creating other models for the results if it is not possible to stay with the same design. I was hoping that I could save one object in chunks – Akira Kotsugai Apr 10 '20 at 11:31

least memory comsuming way of saving large python data into DB

1 Answers1