1

I am trying to convert a function that generate a Boolean index based on a date and a name to work with Numba but I have an error.

My project start with a Dataframe TS_Flujos with a structure as follow.

Fund name, Date, Var Commitment Cash flow
Fund 1    Date 1  100           -20
Fund 1    Date 2  10            -2
Fund 1    Date 3  0             +10
Fund 2    Date 3  100            0
Fund 2    Date 4  0             -10
Fund 3    Date 2  100           -20
Fund 3    Date 3  20             30

Each line is a cashflow of a specific fund. For each line I need to calculate the cumulated commitment to date and substract the amount funded to date, the "unfunded". For that, I iterate over the dataframe TS_Flujos, identify the fund and date, use a Boolean Index to identify the other "relevant rows" in the dataframe, the one of the same funds and with dates prior to the current with the following function:

def date_and_fund(date, fund, c_dates, c_funds):
    i1 = (c_dates <= date)
    i2 = (c_funds == fund)
    result = i1 & i2
    return result

And I run the following loop:

n_flujos = TS_Flujos.to_numpy()
for index in range(len(n_flujos)):
     f = n_dates[index]
     p = n_funds[index]
     date_fund = date_and_fund(f, p, n_dates, n_funds)
    
     TS_Flujos['Cumulated commitment'].values[index] = n_varCommitment[date_fund].sum()

This is a simplification but I also have segregate the cashflow by type and calculate many other indicators for each row. For now I have 44,000 rows but this number should increase a lot in the future and this loop already takes 1min to 2min depending of the computer. I am worried about the speed when I x10 the cashflow database and this is a small part of the total project. I have tried to understand how to use your previous answer to optimize it but I can't find a way to vectorize or use list comprehension here.

Because there is no dependency in calculation I tried to parallel the code with Numba.

@njit(parallel=True)
def cashflow(array_cashflows):

    for index in prange(len(array_cashflows)):

        f = n_dates[index]
        p = n_funds[index]
        date_funds = date_and_fund(f, p, n_dates, n_funds)

        TS_Flujos['Cumulated commitment'].values[index] = n_varCommitment[date_fund].sum()
    return

flujos(n_dates)

But I get the following error:

Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Program Files\JetBrains\PyCharm 2020.1.3\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm 2020.1.3\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Users/ferna/OneDrive/Python/Datalts/Dataltsweb.py", line 347, in <module>
    flujos(n_fecha)
  File "C:\Users\ferna\venv\lib\site-packages\numba\core\dispatcher.py", line 415, in _compile_for_args
    error_rewrite(e, 'typing')
  File "C:\Users\ferna\venv\lib\site-packages\numba\core\dispatcher.py", line 358, in error_rewrite
    reraise(type(e), e, None)
  File "C:\Users\ferna\venv\lib\site-packages\numba\core\utils.py", line 80, in reraise
    raise value.with_traceback(tb)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'date_and_pos': cannot determine Numba type of <class 'function'>
File "Dataltsweb.py", line 324:
def flujos(array_flujos):
    <source elided>
        p = n_idpos[index]
        fecha_pos = date_and_pos(f, p, n_fecha, n_idpos)
        ^
FernandoP
  • 11
  • 2

1 Answers1

0

Given the way that you have structured you're code, you won't be gaining any performance by using Numba. You're using the decorator on a function that is already vectorized, and will perform fast. What would make sense is to try and speed up the main loop, not just CapComp_MO.

In relation to the error, it seems that it has to do with the types. Try to add explicit typing see if it solves the issue, here are Numba's datatypes for datetime objects.

I'd also recommend you to avoid .iterrows() for performance issues, see this post for an explanation. As a side note, t1[:]: this takes a full slice, and is the same as t1.

Also, if you add a minimal example (code and dataframes), it might help in improving your current approach. It looks like you're just indexing in each iteration, so you might not need to loop at all if you use numpy.

yatu
  • 86,083
  • 12
  • 84
  • 139
  • Thanks for the the advices / comments, I will try to implement the optimization of the main loop and send a more relevant version of my code! – FernandoP Sep 10 '20 at 14:31
  • Dear Yatu, I have been working on the code but I just can't find a way to accelerate my main loop. I added some detail to my previous post, if you can help me. – FernandoP Sep 15 '20 at 23:07
  • do you have a feedback? – FernandoP Sep 21 '20 at 23:57