0

I'm trying to make a python program that matches foreign data input to record in my database which is on another server in my organization.

the issue is I get millions of records at a time and the program takes 4 hours per 1 million records matched with close to no CPU usage.

I've set up timers on every part of the program to test and it's very fast except the part that interact with database through ODBC, so how can I do something like multi processing or multi threading to speed up the program and make the loop start again and not stop waiting for the database response to write it into a file then continue the loop?

Basically I need each iteration of the loop when faced with down time waiting for the database to respond to perform the next iteration immediately and then goes back to right the fetched data to a file (Or append them to a list) while having other iterations of the same loop waiting for the Db response as well.

Thanks in advance.

This is what the code looks like right now

def run_query(DbConn, match_data):
    for record in match_data:
        stmt = ibm_db.prepare(DbConn, f"select trim(Name) || ' ' || trim(FTHFNAME) || ' ' || trim(FTHSNAME) || ' ' || trim(FTHTNAME) AS NAME, BIRTHDATE, IDNUMSTAT PEOPLE where NAME = ? with ur;", {ibm_db.SQL_ATTR_CURSOR_TYPE:3})
        ibm_db.bind_param(stmt, 1, f"{record['NAME']}")
        try:
            ibm_db.execute(stmt)
        except Exception:
            print(ibm_db.stmt_errormsg(stmt))
            ibm_db.close(DbConn)
            sys.exit("Closed duo to an error")
        if ibm_db.fetch_assoc(stmt, 1):
            print("Matched")
            print_row_to_csv(ibm_db.fetch_assoc(stmt, 1))
        else:
            print("Not matched")
            print_row_to_csv({"Not Matched": "Not found"}) 
        
  • https://stackoverflow.com/questions/25889268/running-same-function-for-multiple-files-in-parallel-in-python probably this could help – venkata krishnan Aug 28 '23 at 04:48

1 Answers1

0

Instead of executing query in loop for individual data, make bundles of (lets say 10000 or more) data and execute it altogether in a query. And execute query for separate bundles. That will help you in executing large data in cluster instead of hitting query thousands of time. Also prepare the query once outside of the loop if you can and bind_params inside the loop.

  • It is a select statement with where name = {name}. Not an insert statement so I guess I can't bundle data for select statements. – Mahmoud Khaled Sayed Aug 28 '23 at 17:08
  • You can make different bundles for match data and then loop for those bundles. In this line "for record in match_data:" after bundling the match data as a whole instead of executing for single name you could so something like 'field1 in (name_bundle)' instead of just looking to execute the query for single name like where 'NAME = *' which is what you have done here. you did NAME = ?, but you could search for while bundle like: where NAME in bundle. Also you can use Greenlets and async functions to not stop for waiting for a response. – Aliz Acharya Aug 29 '23 at 07:38