Parallelization of for loop: pandas

Question

I have 1000 Tables and what I need to do:

Extract data from table x
Do some Calculation based on the data
Save results in one big table

Here is my code

table_names = pd.read_sql("SELECT table_name FROM information_schema.tables where table_schema = 'some_schema';",some_engine)
table_names = table_names.iloc[:,0].to_numpy()

def count_something(database_r,database_w,tables):
    for table in tables:
              
        overall_count = pd.read_sql('''
            SELECT COUNT(DISTINCT some_value) AS countOverall
            FROM some_schema.{0} a
            WHERE some_conditions'''.format(table)
            )

        second_overall_count = pd.read_sql('''
            SELECT COUNT(DISTINCT some_other_value) AS countOverall
            FROM some_schema.{0} a
            WHERE some_other_conditions'''.format(table)
            )
    
        if overall_count.empty:
            print('DataFrame is empty:',table)
            continue
            
        df = pd.concat([overall_count,second_overall_count],axis = 1)
        
        
        df.to_sql(name = 'BigTable',con = some_engine,if_exists = 'append')
        
count_searchterm_location(database_read,data_write,table_names)

Code works correctly, but it's slow(one table per one iteration). My question is: is it possible to speed up the code by using parallelization? Any approaches and recommendations are welcome?

I would use the parallel map() function from the multiprocessing package: https://stackoverflow.com/questions/1704401/is-there-a-simple-process-based-parallel-map-for-python — Nick ODell, Feb 17 '22 at 15:52
PS: I would also see if your two SQL queries can be rewritten as a single query. e.g. `SELECT COUNT(DISTINCT some_value) AS count1, COUNT(DISTINCT some_other_value) AS count2 FROM ...` Doing it in a single pass would likely be faster. — Nick ODell, Feb 17 '22 at 15:54

score 0 · Answer 1 · answered Feb 17 '22 at 18:45

If you are looking to simply get one value (count) from each sql query then it likely makes more sense to let SQL DB do the hard work and just bring the result back into pandas.

table_names = pd.read_sql("SELECT table_name FROM information_schema.tables where table_schema = 'some_schema';",some_engine)
table_names = table_names.iloc[:,0].to_numpy()

def count_something(database_r,database_w,tables):
    for table in tables:
      count_query += f"select 'CountOverall', '{table}', count(distinct some_value) as cnt"
                     f"from some_schema.{table} a"
                     f"where some_conditions"
                     f"UNION"
                     f"select 'Count2Overall', '{table}', count(distinct some_value) as cnt"
                     f"from some_schema.{table} a"
                     f"where some_other_conditions;"
                     f"UNION";

   count_query = count_query[:-5] # Remove the final 'UNION' off the query

   df = pd.read_sql(count_query)

This way you build a query which executes against all the tables at once in the database and your results look like

CountOverall         table1      6
Count2Overall        table1     12
CountOverall         table2      5
Count2Overall        table2      4

...

Parallelization of for loop: pandas

1 Answers1