I am trying to use Pandas' df.to_sql and SQlite3 in python to put about 2GB of data with about 16million rows in a database. My strategy has been to chunk the original CSV into smaller dataframes, perform some operations on them, and then throw them into an SQL database.
As I run this code, it starts out fast but quickly slows down. After about 3 million rows it slows down to such a degree as it doesn't seem like it will finish in any realistic amount of time. What is the cause of this and what can I do about it? My code is below:
def chunk_read_CSV_to_db(database, table, filepath, chunksize, delimiter=','):
start = dt.datetime.now()
conn = sqlite3.connect(database)
index_start = 1
j=0
for df in pd.read_csv(filepath, chunksize=chunksize, iterator=True, encoding='utf-8', sep=delimiter):
j+=1
print '{} seconds: complete {} rows'.format((dt.datetime.now() -start).seconds, j*chunksize)
df.to_sql(name=table, con=conn, flavor='sqlite', if_exists='append')
conn.close()
db_name = 'store_data.db'
f9 = 'xrf_str_geo_ta4_1511.txt'
chunksize = 20000
chunk_read_CSV_to_db(os.path.join(fp, db_name), os.path.splitext(f9)[0], os.path.join(fp, f9), chunksize = chunksize, delimiter='\t')