I have a pandas.Series of business dates called s_dates
. I want to pass each of these dates (together with some other hyper-parameters) to a function called func_sql_to_df
which formats an SQL-query and then returns a pandas.DataFrame. Finally, all of the DataFrames should be concatenated (appended) into a single pandas.DataFrame called df_summary
where the business date is the identifier.
From here I need to do two things:
- export
df_summary
to an Excel sheet or csv-file. - group
df_summary
by the dates and then apply another function calledfunc_analysis
to each column.
My attempt is something like this
df_summary = pd.concat(list(
s_dates.apply(func_sql_to_df, args=hyper_param)
))
df_summary.groupby('dates').apply(func_analysis)
# Export data
...
However, the first statement where df_summary is defined takes quite long. There are a total of 250 dates where the first couple of iterations takes approximately 3 seconds but it increases to over 3 minutes after about 100 iterations (and continues to do so). All of the SQL-queries take more or less the same time to execute individually and the resulting dataframes all have the same number of observations.
I want to increase the performance of this setup, but I am already not using any loops (only apply-functions) and the SQL-query has already been optimized a lot. Any suggestions?
Update: If I am not mistaken then my attempt is actually the suggested solution as stated in the accepted answer to this post.
Update2: My SQL-query looks something like this. I do not know if all the dates can be passed at ones as the conditions specified in the WHERE-statement must hold for each passed value in dates.
select /*+ parallel(auto) */
MY_DATE as EOD_DATE -- These are all the elements in 'DATES' passed
, Var2
, Var3
, ColA
, ColB
, ...
, ColN
from Database1
where
Var2 in (select Var2 from Datebase2 where update_time < MY_DATE) -- Cond1
and Var3 in (select Var3 from DataBase3 where EOD_DATE = MY_DATE) -- Cond2
and cond3
and cond4
...
Running the query for any date in dates on its own seems to take around 2-8 seconds. However, as mentioned some of the iterations in the apply-function takes more than 3 minutes.