8

I need to insert a 60000x24 dataframe into a mysql database (MariaDB) using sqlalchemy and python. The database runs locally and the data insertion runs locally as well. For now I have been using the LOAD DATA INFILE sql query, but this requires the dataframe to be dumped into a CSV file, which takes about 1.5-2 seconds. The problem is I have to insert 40 or more of these dataframes, so the time is critical.

If I use df.to_sql then the problem gets much worse. The data insertion takes at least 7 (up to 30) seconds per dataframe.

The code I´m using is provided here below:

sql_query ="CREATE TABLE IF NOT EXISTS table(A FLOAT, B FLOAT, C FLOAT)"# 24 columns of type float
cursor.execute(sql_query)
data.to_sql("table", con=connection, if_exists="replace", chunksize=1000)

Which takes between 7 and 30 seconds to be executed. Using LOAD DATA, the code looks like:

sql_query = "CREATE TABLE IF NOT EXISTS table(A FLOAT, B FLOAT, C FLOAT)"# 24 columns of type float
cursor.execute(sql_query)
data.to_csv("/tmp/data.csv")
sql_query = "LOAD DATA LOW_PRIORITY INFILE '/tmp/data.csv' REPLACE INTO TABLE 'table' FIELDS TERMINATED BY ','; "
cursor.execute(sql_query)

This takes 1.5 to 2 seconds, mainly due to dumping the file to CSV. I could improve this last one a bit by using LOCK TABLES, but then no data is added into the database. So, my questions here is, is there any method to speed this process up, either by tweaking LOAD DATA or to_sql?

UPDATE: By using an alternative function to dump the dataframes into CSV files given by this answer What is the fastest way to output large DataFrame into a CSV file? I´m able to improve a little bit of the performance, but not that significantly. Best,

Charlie
  • 252
  • 5
  • 16
  • Try manual insertion by looping over chunks and executing manual bulk insert commands directly on the database connection. You can also utilize multiprocessing / multithreading. Also try disabling any indexes / constraints on target tables before executing inserts. – altunyurt Aug 11 '19 at 21:34
  • 2
    Please provide: `SHOW CREATE TABLE`. The SQL that would be used if you `INSERTed` directly instead of via `LOAD DATA`. Whether you are adding to a table or creading a new table. – Rick James Aug 11 '19 at 21:59
  • Don't know your needs, but have you considered sqlite3? In general it's '2x - 10x faster' than MySql/Sql bc of reduced overhead, file database vs unix socket, etc -- here's some (old) info. https://www.sqlite.org/speed.html – Zach Oakes Aug 11 '19 at 22:37
  • @RickJames, please see my updated answer. – Charlie Aug 12 '19 at 08:10
  • @altunyurt: Could you please elaborate your answer more detailed? – Charlie Aug 12 '19 at 08:10
  • @Charlie https://github.com/pandas-dev/pandas/blob/f34dbbf28b4c4d1b241f7bda155284fe0c131d18/pandas/io/sql.py#L1243 as you can see, there's a lot of overhead to get your dataframe translated into database records. But also you're doing load data, which is the fastest way of inserting data -i missed this at first, thinking you were again executing to_sql - . Try multiple concurrent sessions for load data. Also have a look at csv and connect storage engines for mariadb. – altunyurt Aug 12 '19 at 09:08
  • @altunyurt: Thanks for your comment. I have measured the times for the data insertion instruction by instruction and the main problem is dumping the data into a CSV file. This is what consumes most of the time of the data insertion. The LOAD DATA sql query is executed in a reasonable time. I´m going to check the concurrency to see if this alleviates the problem. – Charlie Aug 12 '19 at 10:01
  • 2
    What is `if_exists="replace"` referring to? If it is checking every row against every other row to avoid dups, well that is _terribly slow_. And probably unnecessary. even it is necessary, it could be handled quickly after loading. – Rick James Aug 12 '19 at 15:16
  • Would something like this help? https://stackoverflow.com/questions/38204064/turn-pandas-dataframe-into-a-file-like-object-in-memory – Evan Aug 13 '19 at 17:51

1 Answers1

5

If you know the data format (I assume all floats), you can use numpy.savetxt() to drastically reduce time needed to create CSV:

%timeit df.to_csv(csv_fname)
2.22 s ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  

from numpy import savetxt
%timeit savetxt(csv_fname, df.values, fmt='%f', header=','.join(df.columns), delimiter=',')
714 ms ± 37.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Please note that you may need to prepend

df = df.reset_index()

to have lines numbered with unique keys and retain the .to_csv() formatting style.

igrinis
  • 12,398
  • 20
  • 45