Appending to a SQLite database iteratively using pandas to_sql()
functionality appears to be decreasing in its performance with every iteration.
The dataframe that I am appending within each iteration is approximately the same size, but the database is obviously increasing in size.
Here is the code snippet I am using:
# par_dir is the parquet directory containing thousands of parquet files for a table
parquet_files = os.listdir(par_dir)
for f in tqdm(parquet_files):
f_path = os.path.join(par_dir, f)
if f.startswith('.'): continue
if int(os.path.getsize(f_path))==0: continue
df = pd.read_parquet(f_path)
# self._cx is a connection to a sqlite database
# self.if_exists is 'replace' on first loop and 'append' on every other loop
df.to_sql(table_name,
con=self._cx,
if_exists=self.if_exists,
chunksize=10000)
The performance of each insert to the database degrades with each file added as shown here:
Or in log-log space:
I assumed that a SQL insert time complexity would be dependent only on the number of rows of the data to be inserted (df
in the code above) not on the size of the existing database.
Why would this time complexity of insertion be dependent upon the database size for which the data is being inserted, and how do I fix it?