Python Pandas - Using to_sql to write large data frames in chunks

Question

I'm using Pandas' to_sql function to write to MySQL, which is timing out due to large frame size (1M rows, 20 columns).

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

Is there a more official way to chunk through the data and write rows in blocks? I've written my own code, which seems to work. I'd prefer an official solution though. Thanks!

def write_to_db(engine, frame, table_name, chunk_size):

    start_index = 0
    end_index = chunk_size if chunk_size < len(frame) else len(frame)

    frame = frame.where(pd.notnull(frame), None)
    if_exists_param = 'replace'

    while start_index != end_index:
        print "Writing rows %s through %s" % (start_index, end_index)
        frame.iloc[start_index:end_index, :].to_sql(con=engine, name=table_name, if_exists=if_exists_param)
        if_exists_param = 'append'

        start_index = min(start_index + chunk_size, len(frame))
        end_index = min(end_index + chunk_size, len(frame))

engine = sqlalchemy.create_engine('mysql://...') #database details omited
write_to_db(engine, frame, 'retail_pendingcustomers', 20000)

There is no official solution at the moment, but we would certainly accept a patch to implement this (there is an issue about chunksize for `read_sql` (https://github.com/pydata/pandas/issues/2908), you can always open one for this). — joris, Jun 03 '14 at 06:51
I created an issue. @joris- if you wouldn't mind please add a comment there and back me up :). https://github.com/pydata/pandas/issues/7347 — Krishan Gupta, Jun 05 '14 at 03:57
Why don't you just export the data as a csv file (it has to be in the format of the table) and then use the load data infile command? — chuse, Aug 01 '14 at 10:06

score 30 · Answer 1 · answered Aug 22 '14 at 08:45

30

Update: this functionality has been merged in pandas master and will be released in 0.15 (probably end of september), thanks to @artemyk! See https://github.com/pydata/pandas/pull/8062

So starting from 0.15, you can specify the chunksize argument and e.g. simply do:

df.to_sql('table', engine, chunksize=20000)

answered Aug 22 '14 at 08:45

joris

133,120
36
247
202

perfect, it save my day! – EntzY Nov 15 '21 at 12:41

score 1 · Answer 2 · edited May 23 '17 at 12:32

There is beautiful idiomatic function chunks provided in answer to this question

In your case you can use this function like this:

def chunks(l, n):
""" Yield successive n-sized chunks from l.
"""
    for i in xrange(0, len(l), n):
         yield l.iloc[i:i+n]

def write_to_db(engine, frame, table_name, chunk_size):
    for idx, chunk in enumerate(chunks(frame, chunk_size)):
        if idx == 0:
            if_exists_param = 'replace':
        else:
            if_exists_param = 'append'
        chunk.to_sql(con=engine, name=table_name, if_exists=if_exists_param)

Only drawback that it doesn't support slicing second index in iloc function.

Kishore · Answer 3 · 2021-08-24T23:29:33.507

0

Reading from one table and writing to other in chunks....

[myconn1 ---> Source Table],[myconn2----> Target Table],[ch= 10000]

for chunk in pd.read_sql_table(table_name=source, con=myconn1, chunksize=ch):
    chunk.to_sql(name=target, con=myconn2, if_exists="replace", index=False,
                 chunksize=ch)
    LOGGER.info(f"Done 1 chunk")

edited Aug 24 '21 at 23:29

answered Aug 23 '21 at 18:47

Kishore

36
3

Only the last chunk will last, due to `replace`: "Drop the table before inserting new values." – Michel de Ruiter Mar 15 '23 at 10:59

Python Pandas - Using to_sql to write large data frames in chunks

3 Answers3

Linked

Related