When inserting a huge pandas dataframe into sqlite via sqlalchemy and pandas to_sql
and a specified chucksize, I would get memory errors.
At first I thought it was an issue with to_sql
but I tried a workaround where instead of using chunksize I used for i in range(100): df.iloc[i * 100000:(i+1):100000].to_sql(...)
and that still resulted in an error.
It seems under certain conditions, that there is a memory leak with repeated insertions to sqlite via sqlalchemy.
I had a hard time trying to replicate the memory leak that occured when converting my data, through a minimal example. But this gets pretty close.
import string
import numpy as np
import pandas as pd
from random import randint
import random
def make_random_str_array(size=10, num_rows=100, chars=string.ascii_uppercase + string.digits):
return (np.random.choice(list(chars), num_rows*size)
.view('|U{}'.format(size)))
def alt(size, num_rows):
data = make_random_str_array(size, num_rows=2*num_rows).reshape(-1, 2)
dfAll = pd.DataFrame(data)
return dfAll
dfAll = alt(randint(1000, 2000), 10000)
for i in range(330):
print('step ', i)
data = alt(randint(1000, 2000), 10000)
df = pd.DataFrame(data)
dfAll = pd.concat([ df, dfAll ])
import sqlalchemy
from sqlalchemy import create_engine
engine = sqlalchemy.create_engine('sqlite:///testtt.db')
for i in range(500):
print('step', i)
dfAll.iloc[(i%330)*10000:((i%330)+1)*10000].to_sql('test_table22', engine, index = False, if_exists= 'append')
This was run on Google Colab CPU enviroment.
The database itself isn't causing the memory leak, because I can restart my enviroment, and the previously inserted data is still there, and connecting to that database doesn't cause an increase in memory. The issue seems to be under certain conditions repeated insertions via looping to_sql
or one to_sql
with chucksize specified.
Is there a way that this code could be run without causing an eventual increase in memory usage?
Edit:
To fully reproduce the error, run this notebook
https://drive.google.com/open?id=1ZijvI1jU66xOHkcmERO4wMwe-9HpT5OS
The notebook requires you to import this folder into the main directory of your Google Drive
https://drive.google.com/open?id=1m6JfoIEIcX74CFSIQArZmSd0A8d0IRG8
The notebook will also mount your Google drive, you need to give it authorization to access your Google drive. Since the data is hosted on my Google drive, importing the data should not take up any of your allocated data.