7

When inserting a huge pandas dataframe into sqlite via sqlalchemy and pandas to_sql and a specified chucksize, I would get memory errors.

At first I thought it was an issue with to_sql but I tried a workaround where instead of using chunksize I used for i in range(100): df.iloc[i * 100000:(i+1):100000].to_sql(...) and that still resulted in an error.

It seems under certain conditions, that there is a memory leak with repeated insertions to sqlite via sqlalchemy.

I had a hard time trying to replicate the memory leak that occured when converting my data, through a minimal example. But this gets pretty close.

import string
import numpy as np
import pandas as pd
from random import randint
import random

def make_random_str_array(size=10, num_rows=100, chars=string.ascii_uppercase + string.digits):
    return (np.random.choice(list(chars), num_rows*size)
            .view('|U{}'.format(size)))

def alt(size, num_rows):
    data = make_random_str_array(size, num_rows=2*num_rows).reshape(-1, 2)
    dfAll = pd.DataFrame(data)
    return dfAll

dfAll = alt(randint(1000, 2000), 10000)

for i in range(330):
    print('step ', i)
    data = alt(randint(1000, 2000), 10000)
    df = pd.DataFrame(data)
    dfAll = pd.concat([ df,  dfAll ])

import sqlalchemy

from sqlalchemy import create_engine
engine = sqlalchemy.create_engine('sqlite:///testtt.db')

for i in range(500):
    print('step', i)
    dfAll.iloc[(i%330)*10000:((i%330)+1)*10000].to_sql('test_table22', engine, index = False, if_exists= 'append')

This was run on Google Colab CPU enviroment.

The database itself isn't causing the memory leak, because I can restart my enviroment, and the previously inserted data is still there, and connecting to that database doesn't cause an increase in memory. The issue seems to be under certain conditions repeated insertions via looping to_sql or one to_sql with chucksize specified.

Is there a way that this code could be run without causing an eventual increase in memory usage?

Edit:

To fully reproduce the error, run this notebook

https://drive.google.com/open?id=1ZijvI1jU66xOHkcmERO4wMwe-9HpT5OS

The notebook requires you to import this folder into the main directory of your Google Drive

https://drive.google.com/open?id=1m6JfoIEIcX74CFSIQArZmSd0A8d0IRG8

The notebook will also mount your Google drive, you need to give it authorization to access your Google drive. Since the data is hosted on my Google drive, importing the data should not take up any of your allocated data.

SantoshGupta7
  • 5,607
  • 14
  • 58
  • 116
  • 1
    Possibly related: [Why doesn't Python release the memory when I delete a large object?](http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm). – unutbu Jun 07 '19 at 12:37
  • 2
    [Alex Martelli says](http://stackoverflow.com/questions/1316767/how-can-i-explicitly-free-memory-in-python/1316799#1316799), "The only really reliable way to ensure that a large but temporary use of memory DOES return all resources to the system when it's done, is to have that use happen in a subprocess, which does the memory-hungry work then terminates." – unutbu Jun 07 '19 at 12:38
  • @unubu very relevant! I am wondering if `multiprocessing` is still the best way to make a subprocess. https://stackoverflow.com/questions/56498356/is-using-multiprocessing-still-the-easiest-way-to-use-make-a-subprocess-that-r – SantoshGupta7 Jun 07 '19 at 16:48
  • 1
    Yes, you could try wrapping your memory-hungry code in a function, then using `multiprocessing` to call that function in a separate process. There is an [example here](https://stackoverflow.com/q/28516828/190597). – unutbu Jun 07 '19 at 17:05
  • 1
    @unutbu that wouldn't explain this situation, since the phenomenon referenced there wouldn't cause a memory error. This may be a memory leak in the `to_sql` method – juanpa.arrivillaga Jun 07 '19 at 17:33
  • @juanpa.arrivillaga indeed, but @unutbu 's fix allowed me to process the database without a memory error. It works since it's writing data to a database file, and not a python variable (it seems that multiprocessing can't change global variables, at least not easily). I agree, that there is likely a memory leak with `to_sql`, if not that, then sqlite or sqlalchemy. – SantoshGupta7 Jun 07 '19 at 18:41
  • 1
    @SantoshGupta7 well, multiprocessing creates *separate python processes*, so there is no shared state without some work to do so. In general, though, you should avoid shared state as much as possible if using multiprocessing. – juanpa.arrivillaga Jun 07 '19 at 18:59
  • 1
    @juanpa.arrivillaga: I'm not convinced there is a memory leak. `to_sql` may be converting the DataFrame into a Python list of lists. The [effbot link](http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm) explains why the Python process's memory grows and never decreases when that happens. – unutbu Jun 07 '19 at 19:26
  • 1
    @unutbu yes, *but that wouldn't explain the memory error*. The whole point of that post is to explain how while it seems your program is utilizing that much memory, that memory is still available *to the process* even if from the perspective of the OS it is consuming it: "However, it’s important to remember that memory that’s returned to a given allocator will be reused by that allocator, even if it’s not returned to the system." – juanpa.arrivillaga Jun 07 '19 at 19:28
  • @juanpa.arrivillaga: I haven't been able to reproduce the OP's problem, so I'm not really sure what is causing the memory error. When I run the OP's code, my machine bogs down just building the DataFrame. The memory is increasing as the DataFrame is built, but that is completely expected. If I reduce the size of the DataFrame, the memory usage plateaus (again as expected) and I do not see a memory leak during the calls to `to_sql`. – unutbu Jun 08 '19 at 15:48
  • I got the memory increase from the code above from running on Google Colab, though I wasn't figure out how to get the exact significant memory increase that I got from my data. `to_sql` is the main issue, since when I run it in a `multiprocessing` process there is no memory leak. If you would like, I can provide a Colab notebook and a link to the data I was using. You would need to import the data to your google drive (it doesn't use any of your data since it's hosted on my drive) and the notebook will need to mount your drive. Let me know if you're interested and I'll make it. – SantoshGupta7 Jun 08 '19 at 16:28
  • Sure! Please post the information. No promises on if/when I can figure out the source of the memory error, but posting the info will help everyone interested in thinking about this problem. – unutbu Jun 08 '19 at 17:46
  • I updated the original post to include the colab notebook and data you'll need to import to the main directory of your google drive. – SantoshGupta7 Jun 08 '19 at 19:37

1 Answers1

7

The Google Colab instance starts with about 12.72GB of RAM available. After creating the DataFrame, theBigList, about 9.99GB of RAM have been used. Already this is a rather uncomfortable situation to be in, since it is not unusual for Pandas operations to require as much additional space as the DataFrame it is operating on. So we should strive to avoid using even this much RAM if possible, and fortunately there is an easy way to do this: simply load each .npy file and store its data in the sqlite database one at a time without ever creating theBigList (see below).

However, if we use the code you posted, we can see that the RAM usage slowly increases as chunks of theBigList is stored in the database iteratively.

theBigList DataFrame stores the strings in a NumPy array. But in the process of transferring the strings to the sqlite database, the NumPy strings are converted into Python strings. This takes additional memory.

Per this Theano tutoral which discusses Python internal memory management,

To speed-up memory allocation (and reuse) Python uses a number of lists for small objects. Each list will contain objects of similar size: there will be a list for objects 1 to 8 bytes in size, one for 9 to 16, etc. When a small object needs to be created, either we reuse a free block in the list, or we allocate a new one.

... The important point is that those lists never shrink.

Indeed: if an item (of size x) is deallocated (freed by lack of reference) its location is not returned to Python’s global memory pool (and even less to the system), but merely marked as free and added to the free list of items of size x. The dead object’s location will be reused if another object of compatible size is needed. If there are no dead objects available, new ones are created.

If small objects memory is never freed, then the inescapable conclusion is that, like goldfishes, these small object lists only keep growing, never shrinking, and that the memory footprint of your application is dominated by the largest number of small objects allocated at any given point.

I believe this accurately describes the behavior you are seeing as this loop executes:

for i in range(0, 588):
    theBigList.iloc[i*10000:(i+1)*10000].to_sql(
        'CS_table', engine, index=False, if_exists='append')

Even though many dead objects' locations are being reused for new strings, it is not implausible with essentially random strings such as those in theBigList that extra space will occasionally be needed and so the memory footprint keeps growing.

The process eventually hits Google Colab's 12.72GB RAM limit and the kernel is killed with a memory error.


In this case, the easiest way to avoid large memory usage is to never instantiate the entire DataFrame -- instead, just load and process small chunks of the DataFrame one at a time:

import numpy as np
import pandas as pd
import matplotlib.cbook as mc
import sqlalchemy as SA

def load_and_store(dbpath):
    engine = SA.create_engine("sqlite:///{}".format(dbpath))    
    for i in range(0, 47):
        print('step {}: {}'.format(i, mc.report_memory()))                
        for letter in list('ABCDEF'):
            path = '/content/gdrive/My Drive/SummarizationTempData/CS2Part{}{:02}.npy'.format(letter, i)
            comb = np.load(path, allow_pickle=True)
            toPD = pd.DataFrame(comb).drop([0, 2, 3], 1).astype(str)
            toPD.columns = ['title', 'abstract']
            toPD = toPD.loc[toPD['abstract'] != '']
            toPD.to_sql('CS_table', engine, index=False, if_exists='append')

dbpath = '/content/gdrive/My Drive/dbfile/CSSummaries.db'
load_and_store(dbpath)

which prints

step 0: 132545
step 1: 176983
step 2: 178967
step 3: 181527
...         
step 43: 190551
step 44: 190423
step 45: 190103
step 46: 190551

The last number on each line is the amount of memory consumed by the process as reported by matplotlib.cbook.report_memory. There are a number of different measures of memory usage. On Linux, mc.report_memory() is reporting the size of the physical pages of the core image of the process (including text, data, and stack space).


By the way, another basic trick you can use manage memory is to use functions. Local variables inside the function are deallocated when the function terminates. This relieves you of the burden of manually calling del and gc.collect().

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Wow this is amazing, and explains why I wasn't able to create a minimal example. "it is not implausible with essentially random strings such as those in theBigList that extra space will occasionally be needed and so the memory footprint keeps growing." Because my randomly generated text were all the same size. I had some success in recreating the issue when I varied the string sizes, but it still was not as much variation as in my data, so the issue was not fully reproduced. – SantoshGupta7 Jun 09 '19 at 03:54
  • "it is not implausible with essentially random strings such as those in theBigList that extra space will occasionally be needed and so the memory footprint keeps growing." So theoretically, if I inserted into the first chunk (10000 rows in the case of my example code) of my dataframe strings that were larger than anything in my dataframe, then the error would not occur? Because no new lists will need to be created, so every chunk afterwards uses less space? – SantoshGupta7 Jun 09 '19 at 04:01
  • 1
    There are free lists for objects of different sizes. So even if you made 10K rows of large strings to "prime" the free lists, it would not satisfy the need for more space if smaller strings later need to be allocated. So unfortunately, that would not stop memory usage from growing. – unutbu Jun 09 '19 at 11:54
  • I see. This issue would still exist, to a lesser degree, in your code right? `toPD.to_sql('CS_table', engine, index=False, if_exists='append')`, `toPD = pd.DataFrame(comb).drop([0, 2, 3], 1).astype(str)`, and `comb = np.load(path, allow_pickle=True)` are still going to need different sized list right? But since since they each have much fewer rows, it is more likely that the lists used for objects will similar in size, and thus a higher chance of the same lists being used objects? – SantoshGupta7 Jun 09 '19 at 13:39
  • 1
    Yes, correct. The growth in free lists still occurs, but is mainly a non-issue because we are well under the RAM limit as long as `theBigList` is not materialized all at once. – unutbu Jun 09 '19 at 13:47
  • 1
    If you really need to control the growth of free lists, then pushing the memory-hogging parts into a function, then using `multiprocessing` to call that function in a separate process is the answer. (The memory is freed to the OS when the separate process terminates.) – unutbu Jun 09 '19 at 13:51
  • Earlier it was mentioned "By the way, another basic trick you can use manage memory is to use functions. Local variables inside the function are deallocated when the function terminates.". Is there a difference between the memory released when a just function is used and terminated, vs the memory is released when that function is used within a `multiprocessing` process and terminated? – SantoshGupta7 Jun 09 '19 at 14:00
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/194662/discussion-between-unutbu-and-santoshgupta7). – unutbu Jun 09 '19 at 14:01