15

I need to efficiently insert about 500k (give or take 100k) rows of data into my PostgreSQL database. After a generous amount of google-ing, I've gotten to this solution, averaging about 150 (wall-clock) seconds.

def db_insert_spectrum(curs, visual_data, recording_id):
    sql = """
        INSERT INTO spectrums (row, col, value, recording_id)
        VALUES %s
    """

    # Mass-insertion technique
    # visual_data is a 2D array (a nx63 matrix)
    values_list = []
    for rowIndex, rowData in enumerate(visual_data):
        for colIndex, colData in enumerate(rowData): # colData is the value
            value = [(rowIndex, colIndex, colData, recording_id)]
            values_list.append(value)

    psycopg2.extras.execute_batch(curs, sql, values_list, page_size=1000)

Is there a faster way?

forteddyt
  • 184
  • 1
  • 1
  • 8

2 Answers2

24

Based on the answers given here, COPY is the fastest method. COPY reads from a file or file-like object.

Since memory I/O is many orders of magnitude faster than disk I/O, it is faster to write the data to a StringIO file-like object than to write to an actual file. The psycopg docs show an example of calling copy_from with a StringIO as input.

Therefore, you could use something like:

try:
    # Python2
    from cStringIO import StringIO 
except ImportError:
    # Python3
    from io import StringIO

def db_insert_spectrum(curs, visual_data, recording_id):
    f = StringIO()
    # visual_data is a 2D array (a nx63 matrix)
    values_list = []
    for rowIndex, rowData in enumerate(visual_data):
        items = []
        for colIndex, colData in enumerate(rowData): 
            value = (rowIndex, colIndex, colData, recording_id)
            items.append('\t'.join(map(str, value))+'\n')
        f.writelines(items)
    f.seek(0)
    cur.copy_from(f, 'spectrums', columns=('row', 'col', 'value', 'recording_id'))
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • I was trying to do something similar before I asked this question. I got a different error when I tried implementing it. For your way, the error is `psycopg2.DataError: invalid input syntax for integer: "(0, 0, 0, 71)"` The row, col, value, and recording_id columns are all ints – forteddyt Nov 05 '17 at 01:47
  • An earlier version of my code used `value = [(rowIndex, colIndex, colData, recording_id)]`. Instead, remove the brackets, as above. (You might have to refresh your web page.) I think that should fix the error. – unutbu Nov 05 '17 at 01:55
  • Ahh yes, thanks for that. With that change I'm at the point I was when I tried implementing this. It's giving me the following error: `psycopg2.DataError: extra data after last expected column CONTEXT: COPY spectrums, line 63: "0 62 0 731 0 0 73"` – forteddyt Nov 05 '17 at 01:57
  • Are you using `f.writelines`? That too was a mistake on my part -- it lacks the `\n`s needed to separate lines. Again make sure you've refreshed the web page and try `f.write('\n'.join(items))` (see above.). – unutbu Nov 05 '17 at 02:00
  • Unfortunately, I am using what you have in your updated version :(. It seems like it's struggling when it tries to insert the next row (as it seemingly goes through the first cols 0-62 successfully) – forteddyt Nov 05 '17 at 02:03
  • Doh, my mistake again. `items.append('\t'.join(map(str, value))+'\n')` will include a `\n` after every line. See the edited code above. – unutbu Nov 05 '17 at 02:07
  • 1
    Yes! Absolutely phenomenal! Thank you so much, it's only taking a 10th of the time now! Thank you so much for your help. I'd upvote the answer (in addition to selecting as answer) but I don't have enough rep... – forteddyt Nov 05 '17 at 02:11
  • I'd stress to add `f.seek(0)` to move the object cursor to the beginning. Otherwise, `copy_from` will not insert anything (write puts it at the end) – Jan Sila Apr 19 '18 at 07:10
3

I don't know whether .execute_batch can accept generator, but can u try something like:

def db_insert_spectrum(curs, visual_data, recording_id):
    sql = """
        INSERT INTO spectrums (row, col, value, recording_id)
        VALUES %s
    """
    data_gen =  ((rIdx, cIdx, value, recording_id) for rIdx, cData in enumerate(visual_data) 
                                               for cIdx, value in enumerate(cData))
    psycopg2.extras.execute_batch(curs, sql, data_gen, page_size=1000)

It might be faster.

Marian
  • 3,789
  • 2
  • 26
  • 36
Nf4r
  • 1,390
  • 1
  • 7
  • 9
  • IS faster. And much more understandable to me. Easier to convert code written for executemany into this style. – pauljohn32 Sep 17 '20 at 13:38