I have 1000 Tables and what I need to do:
- Extract data from table x
- Do some Calculation based on the data
- Save results in one big table
Here is my code
table_names = pd.read_sql("SELECT table_name FROM information_schema.tables where table_schema = 'some_schema';",some_engine)
table_names = table_names.iloc[:,0].to_numpy()
def count_something(database_r,database_w,tables):
for table in tables:
overall_count = pd.read_sql('''
SELECT COUNT(DISTINCT some_value) AS countOverall
FROM some_schema.{0} a
WHERE some_conditions'''.format(table)
)
second_overall_count = pd.read_sql('''
SELECT COUNT(DISTINCT some_other_value) AS countOverall
FROM some_schema.{0} a
WHERE some_other_conditions'''.format(table)
)
if overall_count.empty:
print('DataFrame is empty:',table)
continue
df = pd.concat([overall_count,second_overall_count],axis = 1)
df.to_sql(name = 'BigTable',con = some_engine,if_exists = 'append')
count_searchterm_location(database_read,data_write,table_names)
Code works correctly, but it's slow(one table per one iteration). My question is: is it possible to speed up the code by using parallelization? Any approaches and recommendations are welcome?