Implement batch insert to improve performance

Question

I have written the following code to insert data to MEMSql which has almost similar syntax like in MySQL.

def get_connection(db=DATABASE):
    """ Returns a new connection to the database. """
    return database.connect(host=HOST, port=PORT, user=USER, password=PASSWORD, database=db)



def insert_data(data):
    print 'inserting data...'

    for item in data:
        vars_to_sql = []
        keys_to_sql = []
        print(item)
        for key,value in item.iteritems():
             if key == '__osHeaders':
                value = str(value)
             if isinstance(value, unicode):
                vars_to_sql.append(value.encode('ascii', 'ignore'))
                keys_to_sql.append(key.encode('ascii', 'ignore'))
             else:
                vars_to_sql.append(value)
                keys_to_sql.append(key)

        keys_to_sql = ', '.join(keys_to_sql)
        with get_connection() as conn:

             c = conn.execute("INSERT INTO tablename (%s) VALUES %r" % (keys_to_sql,  tuple(vars_to_sql),))
             print c

The field names can not be hard coded since they might change according to the data I get from the other end. Any way its a dictionary I'm iterating here. Since this single insertion is very slow I need to take batch size as a variable, form the query statement and insert it accordingly. So the query for batch size of 2 will be INSERT INTO tablename col1, col2 VALUES ('a', 'b'),('c','d')

Please help me how to introduce it here.

Maybe have a look at http://stackoverflow.com/a/6482610/3207406 — oliverpool, Oct 14 '15 at 07:04
Or maybe http://stackoverflow.com/questions/8134602/psycopg2-insert-multiple-rows-with-one-query — oliverpool, Oct 14 '15 at 07:06
2nd one does not specify the fields. 1st one I need to get a copy of the data i need and execute. Why cant I do it in the same dictionary I have — Marlon Abeykoon, Oct 14 '15 at 07:10
In the 2nd one, just insert your `col1, col2` before `VALUES` — oliverpool, Oct 14 '15 at 07:14
Or is this also changing? (which might actually be the main point of your question) — oliverpool, Oct 14 '15 at 07:15
Yes Fieldnames also change which I use `keys_to_sql` to define them. Thanks for your answer I'l try it — Marlon Abeykoon, Oct 14 '15 at 09:05

score 2 · Accepted Answer · answered Oct 14 '15 at 16:30

2

If you are using the MemSQL Python library you can use the multi_insert helper provided in the memsql.common.query_builder package. For example:

from memsql.common.query_builder import multi_insert
from memsql.common.database import connect

sql, params = multi_insert("my_table", { "foo": 1 }, { "foo": 2 })
# sql = 'INSERT INTO `my_table` (`foo`) VALUES (%(_QB_ROW_0)s), (%(_QB_ROW_1)s)'
# params = {'_QB_ROW_0': [1], '_QB_ROW_1': [2]}

with connect(...) as conn:
    conn.execute(sql, **params)

Note that multi_insert requires that each record has the same set of columns defined since it translates it to a tuple based insert for the query.

answered Oct 14 '15 at 16:30

Carl Sverre

1,139
10
19

OP: `Fieldnames also change which I use keys_to_sql to define them`. Your answer: `Note that multi_insert requires that each record has the same set of columns` so I don't think this will solve the question – oliverpool Oct 15 '15 at 07:15
but you have a good point: `multi_insert` seems to be the `bulk_insert` for memsql! Just need to check if it accepts `DEFAULT` values :-) – oliverpool Oct 15 '15 at 07:19
I get a Json string which I convert to List of Dictionaries `[{ "foo": 1 }, { "foo": 2 }] `using `json.loads()` So how should I take from that to input to this function `multi_insert()` ? – Marlon Abeykoon Oct 16 '15 at 06:02
2

Probably with some `unpacking` like `multi_insert("my_table", *json_list)` – oliverpool Oct 16 '15 at 06:13
But [looking at the source](https://github.com/memsql/memsql-python/blob/master/memsql/common/query_builder.py#L45), it will only consider the columns listed in the first element – oliverpool Oct 16 '15 at 06:17
Thanks for the comments guys, all good points. With MemSQL and mysql, a single insert statement is optimized for a single set of columns. If you want to insert different subsets of your column, simply run multiple multi_insert calls for each different subset. Hope that makes sense. – Carl Sverre Oct 16 '15 at 20:49
Oh and yes, DEFAULT options will be used for any mission columns with this method. – Carl Sverre Oct 16 '15 at 20:50
1

@CarlSverre Thanks for providing an answer using Memsql-python library Which I successfully implemented with few changes and good performance. +1 for you. – Marlon Abeykoon Oct 17 '15 at 08:19
@oliverpool Thanks for your answer which I learnt from that and specially your comment on unpacking the Json. +1 for your answer as well. – Marlon Abeykoon Oct 17 '15 at 08:21
Your welcome, "Which I successfully implemented with few changes ", maybe could you post your final code as an answer (for the people who might have the same issue as you^^) – oliverpool Oct 17 '15 at 08:51

score 1 · Answer 2 · edited May 23 '17 at 12:23

1

The following answer might help you: https://stackoverflow.com/a/8777776/3207406

You can have a tuple containing all the columns that might be used at some point
create a list of tuples with the actual values you want to insert, with a DEFAULT default value,
and then bulk-insert them

edited May 23 '17 at 12:23

Community

1
1

answered Oct 14 '15 at 07:20

oliverpool

1,624
13
30

Implement batch insert to improve performance

2 Answers2