Time complexity of SQL insert and degrading performance with database size - pyspark parquet files to sql via pandas `to_sql()`

Question

Appending to a SQLite database iteratively using pandas to_sql() functionality appears to be decreasing in its performance with every iteration.

The dataframe that I am appending within each iteration is approximately the same size, but the database is obviously increasing in size.

Here is the code snippet I am using:

  # par_dir is the parquet directory containing thousands of parquet files for a table

  parquet_files = os.listdir(par_dir)
  for f in tqdm(parquet_files):
    f_path = os.path.join(par_dir, f)
    if f.startswith('.'): continue
    if int(os.path.getsize(f_path))==0: continue

    df = pd.read_parquet(f_path)

    # self._cx is a connection to a sqlite database
    # self.if_exists is 'replace' on first loop and 'append' on every other loop
    df.to_sql(table_name,
              con=self._cx,
              if_exists=self.if_exists,
              chunksize=10000)

The performance of each insert to the database degrades with each file added as shown here:

Or in log-log space:

I assumed that a SQL insert time complexity would be dependent only on the number of rows of the data to be inserted (df in the code above) not on the size of the existing database.

Why would this time complexity of insertion be dependent upon the database size for which the data is being inserted, and how do I fix it?

It depends on your table schema, indexes, distribution of primary key values, how often the b-tree pages need to be split and rebalanced, etc. etc. — Shawn, Dec 05 '19 at 03:12
since you have also replace in your logic it is not independent from the database size. Also I cannot see, if you close the connection, or how many concurrent connection is active, which also can have affect on the performance. — Adam Dukkon, Dec 05 '19 at 13:39
@Shawn There are no primary keys in the data, and the 'replace' is only used on the first file addition, the rest are appending. This is a sqlite database, so AFAIK there are no concurrent connections allowed. The connection is held open the entire time by the database object, for which this function is a member. — chase, Dec 05 '19 at 20:03

score 0 · Answer 1 · answered Dec 12 '19 at 21:22

There are several options you can try to speed up:

Make all inserts in one transaction like here
Insert into a new temporary table and then insert into ... select ...

Example:

INSERT INTO destinationTable (col1, col2)
SELECT col1, col2
FROM   sourceTable

Like 2 but after the new table creation concatenate two tables into a new one

Example:

CREATE TABLE new_table AS
  SELECT v1
   FROM t1
  UNION
  SELECT v2
  FROM t2

Time complexity of SQL insert and degrading performance with database size - pyspark parquet files to sql via pandas `to_sql()`

1 Answers1