4

Is there a way to use threads to simultaneously perform the SQL queries so I can cut down on processing time of my code below? Is there a better method to perform the same result as below without using the pandas module? Given the size of the data sets I am working with I cannot store the entire dataset in memory and I have found looping over the rows of a SELECT * FROM statement and comparing them against the list I am querying with adds to the processing time.

# DATABASE layout
#  _____________________________________________________________
# |     id      |         name       |        description       |
# |_____________|____________________|__________________________|
# |        1    |         John       |       Credit Analyst     |
# |        2    |         Jane       |          Doctor          |
# |      ...    |          ...       |            ...           |
# |  5000000    |       Mohammed     |         Dentist          |
# |_____________|____________________|__________________________|

import sqlite3


SEARCH_IDS = [x for x in range(15000)]
DATABASE_NAME = 'db.db'

def chunks(wholeList, chunkSize=999):
    """Yield successive n-sized chunks from wholeList."""
    for i in range(0, len(wholeList), chunkSize):
        yield wholeList[i:i + chunkSize] 

def search_database_for_matches(listOfIdsToMatch):
    '''Takes a list of ids and returns the rows'''
    conn = sqlite3.connect(DATABASE_NAME)
    cursor = conn.cursor()
    sql = "SELECT id, name, description FROM datatable WHERE id IN ({})".format(', '.join(["?" for x in listOfIdsToMatch]))
    cursor.execute(sql,tuple(listOfIdsToMatch))
    rows = cursor.fetchall()
    return rows

def arrange(orderOnList,listToBeOrdered,defaultReturnValue='N/A'):
    '''Takes a list of ids in the desired order and list of tuples which have ids as the first items.
       the list of tuples is aranged into a new list corresponding to the order of the source list'''
    from collections import OrderedDict
    resultList=[defaultReturnValue for x in orderOnList]
    indexLookUp = OrderedDict( [ ( value , key )   for   key , value   in enumerate( orderOnList ) ] )
    for item in listToBeOrdered:
        resultList[indexLookUp[item[0]]]=item
    return resultList


def main():
    results=[]
    for chunk in chunks(SEARCH_IDS,999):
        results += search_database_for_matches(chunk)
    results = arrange(SEARCH_IDS,results)
    print(results)


if __name__ == '__main__': main()

1 Answers1

7

Some advices:

Instead of reading the records by chucks using a iterator, you ought to use pagination.

See this questions:

If you're using multithreading / multiprocessing make sure your database can support it. See: SQLite And Multiple Threads

To implement what you want you can use a pool of workers which work on each chunk. See Using a pool of workers in the Python documentation.

Example:

Import multiprocessing 

with multiprocessing.pool.Pool(process = 4) as pool:
    result = pool.map(search_database_for_match, [for chunk in chunks(SEARCH_IDS,999)])
Community
  • 1
  • 1
Laurent LAPORTE
  • 21,958
  • 6
  • 58
  • 103
  • So I am not sure I understand how paging the database records helps in accomplishing the task at hand. The chucking function does not chunk the sql records but rather it chunks the input list that is being used to query the database. Effectively what I am trying to accomplish is check the database to see if a large list of keys exist and if so return the corresponding records in the same order of the large list of keys. if a particular key does not exist I want to return N/A. – Self_Taught_by_Necessity Sep 03 '16 at 18:34
  • The solution you provided is roughly 3x faster! thank you very much! – Self_Taught_by_Necessity Sep 04 '16 at 13:31