4

I am working on a project where I have to write a data frame with Millions of rows and about 25 columns mostly of numeric type. I am using Pandas DataFrame to SQL Function to dump the dataframe in Mysql table. I have found this function creates an Insert statement that can insert multiple rows at once. This is a good approach but MySQL has a limitation on the length of query that can be built using this approach.

Is there a way such that insert that in parallel in the same table so that I can speed up the process?

Dipesh Poudel
  • 429
  • 9
  • 15
  • 1
    There are some interesting suggestions in a similar question [here](https://stackoverflow.com/questions/31997859/bulk-insert-a-pandas-dataframe-using-sqlalchemy) – realr Jul 28 '19 at 22:09

1 Answers1

6

You can do a few things to achieve that.

One way is to use an additional argument while writing to sql.

df.to_sql(method = 'multi')

According to this documentation, passing 'multi' to method argument allows you to bulk insert.

Another solution is to construct a custom insert function using multiprocessing.dummy. here is the link to the documentation :https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.dummy

import math
from multiprocessing.dummy import Pool as ThreadPool

...

def insert_df(df, *args, **kwargs):
    nworkers = 4 # number of workers that executes insert in parallel fashion

    chunk = math.floor(df.shape[0] / nworkers) # number of chunks
    chunks = [(chunk * i, (chunk * i) + chunk) for i in range(nworkers)]
    chunks.append((chunk * nworkers, df.shape[0]))
    pool = ThreadPool(nworkers)

    def worker(chunk):
        i, j = chunk
        df.iloc[i:j, :].to_sql(*args, **kwargs)

    pool.map(worker, chunks)
    pool.close()
    pool.join()

....

insert_df(df, "foo_bar", engine, if_exists='append')

The second method was suggested at https://stackoverflow.com/a/42164138/5614132.