I am trying to convert a function that generate a Boolean index based on a date and a name to work with Numba but I have an error.
My project start with a Dataframe TS_Flujos with a structure as follow.
Fund name, Date, Var Commitment Cash flow
Fund 1 Date 1 100 -20
Fund 1 Date 2 10 -2
Fund 1 Date 3 0 +10
Fund 2 Date 3 100 0
Fund 2 Date 4 0 -10
Fund 3 Date 2 100 -20
Fund 3 Date 3 20 30
Each line is a cashflow of a specific fund. For each line I need to calculate the cumulated commitment to date and substract the amount funded to date, the "unfunded". For that, I iterate over the dataframe TS_Flujos, identify the fund and date, use a Boolean Index to identify the other "relevant rows" in the dataframe, the one of the same funds and with dates prior to the current with the following function:
def date_and_fund(date, fund, c_dates, c_funds):
i1 = (c_dates <= date)
i2 = (c_funds == fund)
result = i1 & i2
return result
And I run the following loop:
n_flujos = TS_Flujos.to_numpy()
for index in range(len(n_flujos)):
f = n_dates[index]
p = n_funds[index]
date_fund = date_and_fund(f, p, n_dates, n_funds)
TS_Flujos['Cumulated commitment'].values[index] = n_varCommitment[date_fund].sum()
This is a simplification but I also have segregate the cashflow by type and calculate many other indicators for each row. For now I have 44,000 rows but this number should increase a lot in the future and this loop already takes 1min to 2min depending of the computer. I am worried about the speed when I x10 the cashflow database and this is a small part of the total project. I have tried to understand how to use your previous answer to optimize it but I can't find a way to vectorize or use list comprehension here.
Because there is no dependency in calculation I tried to parallel the code with Numba.
@njit(parallel=True)
def cashflow(array_cashflows):
for index in prange(len(array_cashflows)):
f = n_dates[index]
p = n_funds[index]
date_funds = date_and_fund(f, p, n_dates, n_funds)
TS_Flujos['Cumulated commitment'].values[index] = n_varCommitment[date_fund].sum()
return
flujos(n_dates)
But I get the following error:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Program Files\JetBrains\PyCharm 2020.1.3\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "C:\Program Files\JetBrains\PyCharm 2020.1.3\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/ferna/OneDrive/Python/Datalts/Dataltsweb.py", line 347, in <module>
flujos(n_fecha)
File "C:\Users\ferna\venv\lib\site-packages\numba\core\dispatcher.py", line 415, in _compile_for_args
error_rewrite(e, 'typing')
File "C:\Users\ferna\venv\lib\site-packages\numba\core\dispatcher.py", line 358, in error_rewrite
reraise(type(e), e, None)
File "C:\Users\ferna\venv\lib\site-packages\numba\core\utils.py", line 80, in reraise
raise value.with_traceback(tb)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'date_and_pos': cannot determine Numba type of <class 'function'>
File "Dataltsweb.py", line 324:
def flujos(array_flujos):
<source elided>
p = n_idpos[index]
fecha_pos = date_and_pos(f, p, n_fecha, n_idpos)
^