2

I have a Python script which uses the pyodbc module to connect to SQL Express 2008. It selects the records in a SQL Server table (over 600,000 rows) and updates a column in that table using a complicated REGEX based function (parseAddress) to extract an address value from two table fields.

When I test this on a TEMP table created from a subset of the data it works fairly fast (10,000 rows in 133.8 seconds (74 per second)) but on the 600,000+ row table it slows down considerably. Interestingly, if I pull the 600,000 records into ArcGIS, export them to a filegeodatabase and run the parseAddress as a field calculator, it completes extremely quickly. I was trying to avoid the extra processing in Arc if possible, hoping that it would be faster on the database.

Python's memory usage goes from 11MB (10,000 rows) to about 145MB when loading the larger table (I have 8GB of RAM) and doesn't get any higher. I am using fetchall() and iterating through the result as I was having some trouble getting it to iterate otherwise - I kept getting an error that the "Previous SQL Query was not a query" when attempting to apply the SQL update using dbCursor, and uCursor was locked from updating while dbCursor was live. I am also unable to access named columns, eg row.ProvID fails, so there might be an issue with my install, I grabbed the latest pyodbc version (3.0.7) today though.

I tried the list() approach in Python is slow when iterating over a large list of rows = list(dbCursor.execute(sqlSelect).fetchall()) , which was 10% faster on my test table, but seems to have a minimal effect on the big table processing.

What can I do to speed this up? update: I let it run - 667041 records updated in 102296.9 seconds (6.5 per second)

Alternatively, if there is some way I could write this in SQL directly I would love to know. Ideally I would like something that can be run against both SQL Server and TeraData.

import pyodbc as p  #pyodbc must be installed
import re #RegEx library
import time

#function to parse the addresses
def parseAddress(string1, string2):
    regexList = r'\bAPT.([A-Z0-9]*)\b|\bAPT.#([A-Z0-9]*)\b|\bLOT\s([A-Z0-9]*)\b|\bUNIT\s([A-Z0-9]*)\b|\bSUITE\s([A-Z0-9]*)\b|\bSTE\s([A-Z0-9]*)\b|\bSTE([0-9]*)\b|\bPO BOX.([0-9]*)\b|\bP O BOX.([0-9]*)\b|\bPOB\s([0-9]*)|\bP\.O\. BOX.([0-9]*)|\bBOX\s([A-Z0-9]*)\b|\bBOX\b|\bRR\s([A-Z0-9]*)\b|\bRR([A-Z0-9]*)\b|#.([A-Z0-9]*)\b|\((.*)\)'
    intersectionList = r'\bAT\b|\bAND\b|\b@\b|\b&\b'
    searchList = ['FLOOR', 'FLR', 'C/']
    str1 = string1.strip().upper()
    str2 = string2.strip().upper()
    # strip leading "#" only
    if str1.startswith('#'):
        str1 = str1.replace('#','')
    if str2.startswith('#'):
        str2 = str2.replace('#','')
    str1 = str1.strip()
    str2 = str2.strip()
    if str1.startswith('ONE '):
        str1 = str1.replace('ONE ','1 ')
    if str2.startswith('ONE '):
        str2 = str2.replace('ONE ','1 ')
    #strip parts we don't want using regexList
    numRuns = regexList.count('|') #count number of | in regexList for for loop
    for x in range(0, numRuns):
        idxs = re.search(regexList, str1)
        if idxs:
            idx = idxs.span()
            str1 = str1.replace(str1[idx[0]:idx[1]],'')
            str1 = str1.strip()
        idxs = re.search(regexList, str2)
        if idxs:
            idx = idxs.span()
            str2 = str2.replace(str2[idx[0]:idx[1]],'')
            str2 = str2.strip()
    str1 = str1.replace(',','')
    str2 = str2.replace(',','')
    str1 = str1.replace("'","''") #escape single quotes for SQL Server insertion
    str2 = str2.replace("'","''") #escape single quotes for SQL Server insertion
    if str1[:1].isdigit() and len(str1)>6 and not any(word in str1 for word in searchList):
        return str1
    elif str2[:1].isdigit() and len(str2)>6 and not any(word in str2 for word in searchList):
        return str2
    elif re.search(r'\bAT\b|\bAND\b|\b@\b|\b&\b', str1) and len(str1)>7:
        return str1
    elif re.search(r'\bAT\b|\bAND\b|\b@\b|\b&\b', str2) and len(str2)>7:
        return str2
    else:
        if (len(str1) > 5) or (len(str2) > 5):
            return str1 + ' ' + str2
        else:
            return '' #string too short to be a valid address

startTime = time.clock()
#database connection
server = 'PC\SQLEXPRESS'
database = 'WORKING'
table = 'TEMP'
#connection using windows authentication
cnxnStr = ( r'DRIVER={SQL Server Native Client 10.0};SERVER=' + server + ';DATABASE=' + database + ';' + 'Trusted_Connection=yes')
#connect
cnxn = p.connect(cnxnStr)
#create cursor
dbCursor = cnxn.cursor() #select cursor
uCursor = cnxn.cursor() #update cursor
#sql select statement
sqlUpdate = ''
sqlSelect = ("SELECT ProvID, Addr1, Addr2, AddrToMatch FROM " + table)
print "running: " + sqlSelect
count = 0
rows = list(dbCursor.execute(sqlSelect).fetchall()) #takes about 2 seconds
dbCursor.close() #close select cursor

try:
    for row in rows:
        providerId = row[0]
        outputString = parseAddress(row[1], row[2])
        sqlUpdate = ("UPDATE " + table + " SET AddrToMatch = '" + outputString + "' WHERE ProvID = '" + providerId + "'")
        print str(count) + ' ' + sqlUpdate
        uCursor.execute(sqlUpdate)
        count = count + 1
        cnxn.commit()

except:
    print "Error occurred on iteration " + str(count) + "\n" + sqlUpdate
    raise #continue the error
finally:
    del dbCursor
    del uCursor
    cnxn.close()
    endTime = time.clock()
    elapsedTime = endTime - startTime
    print str(count) + " records updated in " + str(elapsedTime) + " seconds"
Community
  • 1
  • 1
spg
  • 151
  • 8
  • 1
    You commit every row, add a counter and only commit every 500 rows or so. Also use prepare statement so the cursor/handler can be cached. – MortenB Aug 28 '17 at 14:31

0 Answers0