8

I'm making my first steps using python and sql databases and still am not sure which package to use and how. I have a list with approx 300k dictionaries each with roughly about 20 keys. These dicts shall be inserted into an SQL table.

In my opinion the advantahe of the list of dict approach is, that I explicitly name the columns in which I want to enter specific values. (It might be, that this is not a good approach)

Let me present a more specific example catching the essentials of my problem. The table consists of three columns: ID (Integer), Price (Decimal), Type (string). Type supports null values.

The keys of my dict have the same name and the list of dicts might look like this:

lst = [{'ID':1, 'Price': '9.95', 'Type': None}, 
       {'ID':2, 'Price': '7.95', 'Type': 'Sports'}, 
       {'ID':3, 'Price': '4.95', 'Type': 'Tools'}, ...]

So the questions that arise are the following:

  1. Is the approach using dicts the right? (Note that I have 20 columns)
  2. If yes/or no: How should one perform such a query efficiently?
  3. Is it necessary to convert the prices to Decimal and before the SQL statement, or can this be achieved 'on-the-fly'
  4. Is the None value automatically converted to null, or is there extra work to be done?
Quickbeam2k1
  • 5,287
  • 2
  • 26
  • 42
  • What kind of project are you working on? If you're using python and a database, implementing an MVC like Django might be helpful. https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller https://www.djangoproject.com/ – Adam Martin Nov 10 '15 at 17:45
  • I'm fetching data from a webscraping service (several json files) and need to extract information that I insert into an MSSQL-Server – Quickbeam2k1 Nov 10 '15 at 17:49
  • I'm assuming this is a one-off then, so implementing an entire framework is not a useful suggestion. Which flavour of SQL are you using? Many DBs support bulk inserts in various formats (meaning writing your webscraped data to a file may be more useful). E.g. https://msdn.microsoft.com/en-CA/library/ms188609.aspx (edit wrong link) – Adam Martin Nov 10 '15 at 17:52
  • I'm doing this on a weekly basis. The Server runs on mssql. Each week I'm getting roughly 100 files, which I need to preprocess. Only some, though the large part, of the data shall end in that specific table. Furthermore, after the data are in the server, I need to do some entity resolution, string matching. Then, a new, clean table needs to be created – Quickbeam2k1 Nov 10 '15 at 17:58

3 Answers3

5

Assuming you are using a Python Database API specification compliant database driver.

Type conversions (questions 3 and 4) should be handled by the database driver out-of-the-box.

As for the 2), there is executemany():

cursor.executemany("""
    INSERT INTO 
        mytable
        (id, price, type)
    VALUES
        (%(id)s, %(price)s, %(type)s)
""", lst)
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • So, in your solution, you would not use a list of dictionaries, but rather a list of tuples. Therefore you need to maintain the order in every tuple, right? – Quickbeam2k1 Nov 10 '15 at 18:01
  • 1
    @Quickbeam2k1 nono, I'm leaving `lst` basically as is - a list of dictionaries - you just need to watch that the placeholders inside the query correspond to the keys in the dictionaries. – alecxe Nov 10 '15 at 18:02
  • hmm, so that dictionaries are not ordered ist not an issue here? (i.e. %(id) refers to the ID-key in the dict and so on? Will this statement also be a prepared statement? As far as I thought, one needs to specify ? as placeholders in that case. Due to the amound of queries, a prepared statement is definetly preferred – Quickbeam2k1 Nov 10 '15 at 20:41
  • 1
    @Quickbeam2k1 yup, the order does not matter. And the query is parameterized - meaning you are safe in terms of security. – alecxe Nov 10 '15 at 21:48
  • but am I also "safe" in terms of speed? A parametrized query is not necessarily a prepared statement, is it? The amount of data I might need to update once per week might double or triple. In that case I want to stress the server as less as possible. However, thank you very much so far! – Quickbeam2k1 Nov 10 '15 at 21:51
  • @Quickbeam2k1 prepared statement = parameterized query, terms are used interchangeably. As for the speed, measure and see if this is really a bottleneck. Thanks! – alecxe Nov 10 '15 at 21:58
  • Sorry for the late reply: I experimented a little bit. Unfortunately, pyodbc and pymssql do not implement the DBAPI as I thought the would. 1. pyodbc does not implement pyformat, thus your query does not work there. 2. pymssql supports pyformat but does not prepare parametrized statements. The sql server monitor shows, that pymssql always sends single queries. Pyodbc uses prepared statements. I also tried sql alchemy to "combine" the pyformat and prepared statements. But this runs into the same problems as with the packages mentioned above. Other ideas? – Quickbeam2k1 Nov 11 '15 at 19:46
  • @Quickbeam2k1 thanks for coming back with it. You may also try to make a bulk insert through an ORM, for instance - `sqlalchemy` - and see what queries would actually be send to the database. – alecxe Nov 11 '15 at 19:53
  • Hey, maybe you misread it: I tried using sqlalchemy (core, I thinnk), but there the commands depend on the package you use to communicate with the database, i.e. pyodbc or pymssql. As http://docs.sqlalchemy.org/en/rel_0_8/faq.html suggest, the ORM part of sqal might be too slow. – Quickbeam2k1 Nov 11 '15 at 20:12
  • 1
    So I made some further research. For pymssql there is currently no working parametrized query. Bulk queries simply perform several insert queries. For pyodbc, one should consider issue #62 on github. This seems to resolve some of the performance issues currently appearing in pyodbc performing parametrized queries (each column is prepared again?), but I couldn't test it. So my data set contains roughly 400k rows and 30 columns, and yields the following results pyodbc(3.0.10) -> ~ 900 secs, pyodbc(3.0.10)+sqlalchemy -> 730 sec, pymssql -> 270 sec. Surprisingly, pymssql wins – Quickbeam2k1 Nov 12 '15 at 21:48
  • @Quickbeam2k1 thanks for sharing! Btw, another optimization (probably a micro one) can be to switch to pypy instead of a regular cpython (not sure it would run as is but give it a try). See also: http://stackoverflow.com/questions/29638136/how-to-speed-up-with-bulk-insert-to-ms-server-from-python-with-pyodbc-from-csv. – alecxe Nov 13 '15 at 21:34
  • @Quickbeam2k1 btw, I think you should consider creating a separate thread that would be performance-specific - provide what you currently have observed and what you are aiming to. I think there is no point for us to continue discussion here in comments. Making a separate question would be better since more people would be there to help. Thanks! – alecxe Nov 13 '15 at 21:36
  • for postgresql psycopg2, executemany()=many*inserts instead of bulk insert. check https://stackoverflow.com/questions/8134602/psycopg2-insert-multiple-rows-with-one-query , use psycopg2.extras.execute_values instead, – Ben Feb 17 '23 at 05:04
2
mydb = MySQLdb.connect(host='',    # your host_name
                       user='',    # your username
                       passwd='',  # your password
                       db=''       # your database
                       )
cur= mydb.cursor()
insert_query = "INSERT INTO table_name(feild_1,feild2,feild3) VALUES ( %(id)s, %(price)s, %(type)s);"
cur.executemany(insert_query, lst)
mydb.commit
shreesh katti
  • 759
  • 1
  • 9
  • 23
0

Answers to your questions:

  • No problem with using a list of dictionaries
  • down below is the full application the handle your case
  • it is not necessary to convert the prices to Decimal, in this example in MySQL we declare price as decimal, but in the list, it was set as a string also as integer, but it's saved as decimal
  • The None value automatically converted to null
from tkinter import *
import mysql.connector as myConnector
from tkinter import messagebox
from mysql.connector import Error
def insert(table,lst):

    myList = listNestedDictForTblInsert(lst)
    print(myList)
    
    mySqlStr = f'INSERT INTO {table}(ID, Price, Type) VALUES(%s,%s,%s)' 
    val = myList
    print(mySqlStr)
    print(val)
    myDb = myConnector.connect(host='localhost',
                               database = "libraryDb2",
                               user='root',
                               password='dhso')
    try:
       myCursor = myDb.cursor()
       myCursor.executemany(mySqlStr, val)
       myDb.commit()
       messagebox.showinfo("show info", "Data is saved successfully")
    except Error as e:
       messagebox.showinfo("show info", "Data is not saved")

    myDb.close()
    myCursor.close()


def listNestedDictForTblInsert(data):
#Convert the list of dictionaries into list of tuples
   myList = []
   for i in range(len(data)):
      myList1 = []
      for value in (data[i].values()):
         myList1.append(value)
      myList1 = tuple(myList1)   
      myList.append(myList1)
   return myList
#Output myList:
#[('Ralph', 45), ('Betty', 50), ('Joey', 45), ('Heather', 25)]
           
root = Tk()

lst = [{'ID':1, 'price': 9.95, 'type': None}, 
       {'ID':2, 'Price': '7', 'type': 'Sports'}, 
       {'ID':3, 'Price': 4, 'Type': 'Tools'}]
table = 'test1'
root.title("Testing Part")
btn = Button(root, text = "Insert Dictionary in MYSQL Table", width = 30, command = lambda : insert(table, lst))
btn.pack()
root.mainloop

Application window

mysql table test1 Stored data in table