0

I have to save a significantly large python data into the a mysql database which consists of lists and dictionaries but I get memory exception during the save operation.

I have already benchmarked the saving operation and also tried different ways of dumping the data, including binary format but all methods seemed to consume a lot of memory. Benchmarks below:

MAX MEMORY USAGE DURING JSON SAVE: 966.83MB

SIZE AFTER DUMPING json: 81.03 MB pickle: 66.79 MB msgpack: 33.83 MB

COMPRESSION TIME: json: 5.12s pickle: 11.17s msgpack: 0.27s

DECOMPRESSION TIME: json: 2.57s pickle: 1.66s msgpack: 0.52s

COMPRESSION MAX MEMORY USAGE: json dumping: 840.84MB pickle: 1373.30MB msgpack: 732.67MB

DECOMPRESSION MAX MEMORY USAGE: json: 921.41MB pickle: 1481.25MB msgpack: 1006.12MB

msgpack seems to be the most performant library but the decompression takes up a lot of memory too. I also tried hickle which is said to consume little memory but the final size ended up being 800MB.

Does anyone have a suggestion? Should I just increase the memory limit? can mongodb handle the save operation with less memory?

find below the stacktrace

Traceback (most recent call last):
  File "/opt/python/bundle/32/app/web_platform/kernel/kernel_worker/web_platform/call_kernel.py", line 139, in start_simulation
    simulation_job_object.save()
  File "/opt/python/bundle/32/app/web_platform/kernel/kernel_worker/web_platform/models.py", line 172, in save
    self.clean_fields()
  File "/opt/python/run/venv/local/lib/python3.6/site-packages/django/db/models/base.py", line 1223, in clean_fields
    setattr(self, f.attname, f.clean(raw_value, self))
  File "/opt/python/run/venv/local/lib/python3.6/site-packages/django/db/models/fields/__init__.py", line 630, in clean
    self.validate(value, model_instance)
  File "/opt/python/run/venv/local/lib/python3.6/site-packages/jsonfield/fields.py", line 54, in validate
    self.get_prep_value(value)
  File "/opt/python/bundle/32/app/web_platform/kernel/kernel_worker/web_platform/models.py", line 156, in get_prep_value
    return json.dumps(value, **self.encoder_kwargs)
  File "/usr/lib64/python3.6/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib64/python3.6/json/encoder.py", line 202, in encode
    return ''.join(chunks)
MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/python/run/venv/local/lib/python3.6/site-packages/background_task/tasks.py", line 43, in bg_runner
    func(*args, **kwargs)
  File "/opt/python/bundle/32/app/web_platform/kernel/kernel_worker/web_platform/call_kernel.py", line 157, in start_simulation
    simulation_job_object.save()
  File "/opt/python/bundle/32/app/web_platform/kernel/kernel_worker/web_platform/models.py", line 172, in save
    self.clean_fields()
  File "/opt/python/run/venv/local/lib/python3.6/site-packages/django/db/models/base.py", line 1223, in clean_fields
    setattr(self, f.attname, f.clean(raw_value, self))
  File "/opt/python/run/venv/local/lib/python3.6/site-packages/django/db/models/fields/__init__.py", line 630, in clean
    self.validate(value, model_instance)
  File "/opt/python/run/venv/local/lib/python3.6/site-packages/jsonfield/fields.py", line 54, in validate
    self.get_prep_value(value)
  File "/opt/python/bundle/32/app/web_platform/kernel/kernel_worker/web_platform/models.py", line 156, in get_prep_value
    return json.dumps(value, **self.encoder_kwargs)
  File "/usr/lib64/python3.6/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib64/python3.6/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib64/python3.6/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
MemoryError

example of my code

class Calculation(Model):
        name = db_models.CharField(max_length=120)
        results = JsonNullField(blank=True, null=True)

results = run_calculation()
calculation = Calculation(name="calculation", results=results)
calculation.save()
Akira Kotsugai
  • 1,099
  • 3
  • 12
  • 19

1 Answers1

1

In essence, here is how I would do it to reduce memory consumption and improve performance:

  1. Load json file (no way to stream it in python AFAIK)
  2. Chunk the array of dictionaries into smaller chunks
  3. Convert chunk into objects
  4. Call bulk_create
  5. Garbage collect after every loop iteration
import json
import gc
from myapp.models import MyModel

filename = '/path/to/data.json'
with open(filename, 'r') as f:
    data = json.load(f)
chunk_size = 100
while data:
    chunk = data[:chunk_size]
    data = data[chunk_size:]
    chunk = [ MyModel(**x) for x in chunk ]
    MyModel.objects.bulk_create(chunk)
    gc.collect()

You can play with chunk_size to optimize performance / memory consumption.

2ps
  • 15,099
  • 2
  • 27
  • 47
  • If I understood well I would have to create a separate table only for the slices, right? Is it possible to create chunks of the final array and save them progressively but as one instead of multiple entries? – Akira Kotsugai Apr 07 '20 at 12:59
  • No, no separate table is necessary. – 2ps Apr 07 '20 at 15:14
  • but you create new model objects for each element in the list right? What I would like to achieve is save the entire list as one model attribute, not multiple database rows. is that possible? – Akira Kotsugai Apr 07 '20 at 16:38
  • @AkiraKotsugai: please post your code of what you have so far, and I'm happy to help optimize. – 2ps Apr 08 '20 at 15:34
  • I updated the question. Please keep in mind that I am only creating other models for the results if it is not possible to stay with the same design. I was hoping that I could save one object in chunks – Akira Kotsugai Apr 10 '20 at 11:31