2

I'm trying to speed-up the way to write my DataFrame in a SQLite database. To do so, I'm trying to use Method 4 on this website. I also found a more elegant way to change the variable cursor.fast_executemany = True thanks to a post on Stack Overflow.

The problem is that this error occurs: TypeError: Invalid argument(s) 'fast_executemany' sent to create_engine(), using configuration SQLiteDialect_pysqlite/NullPool/Engine. Please check that the keyword arguments are appropriate for this combination of components.

I'm using sqlalchemy version 1.3.23 and pandas version 1.14.0.

import pandas as pd
from sqlalchemy import create_engine # database connection
from sqlalchemy import event
from IPython.display import display

disk_engine = create_engine('sqlite:///2021_database.db',fast_executemany=True)
data.to_sql('data', disk_engine, if_exists='replace')

Here, data is my DataFrame. When I run this code without fast_executemany=True, it works.

Someone has any idea?

Gord Thompson
  • 116,920
  • 32
  • 215
  • 418
LBOULET
  • 41
  • 1
  • 4

1 Answers1

2

fast_executemany=True is specific to the mssql+pyodbc:// dialect. It will not work with other dialects like sqlite://. For other databases you would normally use method="multi" (or a custom function for PostgreSQL as described in this answer).

However, SQLite appears to have a limit of 999 parameter values in a single SQL statement so you would also need to use the chunksize= argument:

# value to pass to index= argument
use_index = False

# calculate chunksize= , taking into account whether we'll be
# uploading the index
chunk_size = 999 // (len(df.columns) + (1 if use_index else 0))

df.to_sql(
    "table1",
    engine,
    index=use_index,
    if_exists="replace",
    method="multi",
    chunksize=chunk_size,
)

Edit:

Based on comments below, I tested df.to_sql() inserting 1_000_000 rows into a SQLite database on my local hard drive.

method=None (the default): 1_000_000 rows written in 53 seconds

method="multi" (as above): 1_000_000 rows written in 505 seconds

Apparently Python's sqlite3 module has some sort of optimization similar to fast_executemany=True already built-in, so an .executemany() with 1_000_000 rows (pd.to_sql(…, method=None)) is about an order of magnitude faster than ~5 thousand .execute() calls inserting 199 rows each (pd.to_sql(…, method="multi")).

Gord Thompson
  • 116,920
  • 32
  • 215
  • 418
  • Thanks. Is there any substitute to `fast_executemany` with `sqlite://`? I also tried to replace `sqlite://` by `mssql+pyodbc://`, but it doesn't seem to be as simple as that. – LBOULET Feb 24 '21 at 07:53
  • With SQLite you would need to use to_sql's `method="multi"` option. – Gord Thompson Feb 24 '21 at 12:54
  • Indeed, I tried: when I use it, the code runs but nothing happens. I'm forced to kill the script. I made a test with few data. Without `multi`, it takes 40s, but when I use `multi`, I can let it during 20 minutes, it's still working and nothing is written in the database. – LBOULET Feb 25 '21 at 07:44
  • You will probably also have to use `chunksize=`. I have updated my answer. – Gord Thompson Feb 25 '21 at 14:45
  • Hi again. Thank you for updated solution. I just tried the solution you proposed. It returns an error: "OperationalError: too many SQL variables". The value of `chunksize` is 45. Any idea? – LBOULET Mar 05 '21 at 10:05
  • The test I made is based on a 1300000 rows × 23 columns table, clearly it's not something huge. – LBOULET Mar 05 '21 at 10:13
  • `999 // 23` is `43`, not `45` – Gord Thompson Mar 05 '21 at 12:17
  • Your're right, it works. My last question: for this test (1300000 × 23 data), when I use `multi` option, it takes 232 seconds while, w/o this option, it takes 32s. If it's a question of balance between execution time of this option and amount of data, where is the compromise? In other words, how many "data" is necessary to make `multi` option useful? – LBOULET Mar 05 '21 at 16:25
  • If it only takes 32 seconds with `method=None` (the default) and it takes 232 seconds with `method="multi"` then just use the former. The point of using `method="multi"` (and `fast_executemany=True` for that matter) is to reduce the number of round-trips to the database server. SQLite (a file-based database) is a bit different from other (client/server) databases and it could well be that the pandas (Python) cost of building ~30 thousand multi-row INSERT statements is greater than the cost of the SQLite engine processing 1.3 million individual INSERT statements. – Gord Thompson Mar 05 '21 at 17:11