1

I have a pandas.Series of business dates called s_dates. I want to pass each of these dates (together with some other hyper-parameters) to a function called func_sql_to_df which formats an SQL-query and then returns a pandas.DataFrame. Finally, all of the DataFrames should be concatenated (appended) into a single pandas.DataFrame called df_summary where the business date is the identifier.

From here I need to do two things:

  1. export df_summary to an Excel sheet or csv-file.
  2. group df_summary by the dates and then apply another function called func_analysis to each column.

My attempt is something like this

df_summary = pd.concat(list(
    s_dates.apply(func_sql_to_df, args=hyper_param)
))
df_summary.groupby('dates').apply(func_analysis)

# Export data
...

However, the first statement where df_summary is defined takes quite long. There are a total of 250 dates where the first couple of iterations takes approximately 3 seconds but it increases to over 3 minutes after about 100 iterations (and continues to do so). All of the SQL-queries take more or less the same time to execute individually and the resulting dataframes all have the same number of observations.

I want to increase the performance of this setup, but I am already not using any loops (only apply-functions) and the SQL-query has already been optimized a lot. Any suggestions?

Update: If I am not mistaken then my attempt is actually the suggested solution as stated in the accepted answer to this post.

Update2: My SQL-query looks something like this. I do not know if all the dates can be passed at ones as the conditions specified in the WHERE-statement must hold for each passed value in dates.

select /*+ parallel(auto) */ 
         MY_DATE as EOD_DATE -- These are all the elements in 'DATES' passed
       , Var2
       , Var3
       , ColA
       , ColB
       , ...
       , ColN
from Database1
where
    Var2 in (select Var2 from Datebase2 where update_time < MY_DATE) -- Cond1
and Var3 in (select Var3 from DataBase3 where EOD_DATE = MY_DATE) -- Cond2
and cond3
and cond4
...

Running the query for any date in dates on its own seems to take around 2-8 seconds. However, as mentioned some of the iterations in the apply-function takes more than 3 minutes.

Landscape
  • 249
  • 1
  • 2
  • 13
  • Is `func_sql_to_df` actually executing an SQL query as opposed to just "formatting"? – jqurious Feb 14 '23 at 11:38
  • It is both formatting and executing the query. Unfortunately, it is not feasible to extract all the data from a single call/execution of the query. That's why I have to pass all of the dates "one by one". – Landscape Feb 14 '23 at 11:46
  • I see, so it sounds like you are trying to execute all of those queries "at the same time" / "in parallel"? – jqurious Feb 14 '23 at 12:01
  • Yes, that is correct. – Landscape Feb 14 '23 at 12:03
  • You'll probably have to share more details about `func_sql_to_df` e.g. What database are you using? How exactly are you connecting/running the queries? `pd.read_sql_query()`, etc. – jqurious Feb 14 '23 at 12:21
  • I don't really know much about the database itself other than the fact that it is an Oracle-SQL database which I connect to using `con=pyodcb.connect(...)` which I pass to `pandas.read_sql(sql, con=con)`. I cannot share much detail about the query nor the database (my company's policy) but the query starts with a select statement which has `/*+ parallel(auto) */ ColA, ColB, ..., ColN`. In terms using `pandas.concat(...)` as in my code-example - is this then a reasonable approach or should I try something else? – Landscape Feb 14 '23 at 12:43
  • Actually there might exist a way of passing all of the dates at ones to the query - I just do not know if it is possible. (This is update 2) – Landscape Feb 14 '23 at 13:13
  • Nice to see you figured it all out. – jqurious Feb 15 '23 at 10:08

1 Answers1

1

Turns out the trying to use pandas.concat(...) with a pandas.DataFrame.apply(...) as the argument as in my setting above is really slow. I just tried to compare the results by using a for-loop which gives ~x10 times faster performance.

# ~x10 times faster 
dfs = []
for d in dates:
    dfs.append(func_sql_to_df(d, hyper_param))
df_summary = pd.concat(dfs) # It is very important that the concat is outside the for-loop

This can even be run in parallel to get much better results

# ~x10 * (n_jobs) times faster 
from joblib import Parallel, delayed
df_summary = pd.concat(
    Parallel(n_jobs=-1)(delayed(func_sql_to_df)(d, hyper_param) for d in dates)
)
Landscape
  • 249
  • 1
  • 2
  • 13