0

I am Looping through different text files, load text into dictionaries of dictionaries and afterwards connect to a database and insert those texts into a table.

Dictionaries of dictionaries because in every file I have multiple text blocks that contain different parts of text that I need like name, type, content and so on. And because I need meta ordered Information I save the data in those two layers: file level and text block level

This is my Code - first a string to prepare the DB INSERT:

table_q = 'zz_querys'
query_id = 'query_name'
type_col = 'query_type'
descr_col = 'query_descr'
code_col = 'pure_code'

insertQuery = 'INSERT INTO [' + table_q + '] ([' + query_id + '], ' \
                                        '[' + type_col + '], ' \
                                        '[' + descr_col + '], ' \
                                        '[' + code_col + '])' \
                        '\nVALUES (?, ?, ?, ?)'

and then the conncection to the db:

with conn:
    cursor = conn.cursor()
    for file, text_block in all_data.items():
        file = os.path.splitext(file)[0]
        for subtext in text_block:
            cat = text_block[subtext]
            qName = cat['query_name']
            qType = cat['type']
            qDescr = cat['descr_de']
            qCont = cat['content']
            cursor.execute(insertQuery, (qName, qType, qDescr, qCont))

So performance-wise I feel this is not efficient - I call the SQL INSERT Code every time I run through the loop (every insert row) but I'm not coming from the technical side originally so I'm not sure if there is a way that's a lot more efficient.

My alternative idea was to load my nested dicitonary results into a pandas dataframe (or save the results into a dataframe from the beginning) and then insert the whole dataframe into the database at once. I didn't do this in pandas before but i assume it should be possible.

Would this make a huge difference performance wise? Or what else would be the most efficient way to do this?

Matt444
  • 47
  • 9
  • Doing a bulk insert of some sort -- such a loading a data frame or loading from a file -- should be faster than constructing a separate insert for each row. – Gordon Linoff Apr 03 '20 at 11:40

1 Answers1

0

Have you tried cursor.executemany(...)instead of cursor.execute(...)?

Also you can simply add cursor.fast_executemany = True to boost performance of executemany. Refer here for more.

OR

If you want to try it using pandas. Try this:

import sqlalchemy
import os
import pandas as pd
engine = sqlalchemy.create_engine("<Dialect>",isolation_level="AUTOCOMMIT")

for file in Path("/path/to/folder").iterdir():
    with open(file,mode='r' as file_open:
              results.append(file_open.read())
df=pd.DataFrame(results)
df.to_sql(table,engine,index=False,if_exists="append")

Dialects or DBAPIs (See here).

For more info on performance improvement click here.

Roger
  • 42
  • 8