abc.py instancecheck taking an inordinate amount of time for large calculation. Is there a way around it without abandoning python?

Question

I wrote a program for calculating certain polynomials in python and it's reasonably fast, but I ran cProfile on it and the results are disturbing. The specific run takes 296 seconds, which is fine, but the cumulative time spent in abc.py __instancecheck__ is 43 seconds. This is really pointing to writing it in something other than python, especially since there's a calculation I want to run that, with the current code, would take 50 days.

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    201/1    0.001    0.000  296.104  296.104 {built-in method builtins.exec}
        1    0.000    0.000  296.104  296.104 double_samuel_schubmult7.py:1(<module>)
        1   85.132   85.132  295.717  295.717 double_samuel_schubmult7.py:205(schubmult)
    72618   34.608    0.000   94.825    0.001 double_samuel_schubmult7.py:247(<listcomp>)
 46620756   19.717    0.000   60.217    0.000 double_samuel_schubmult7.py:171(elem_sym_func)
        1    0.002    0.002   54.962   54.962 parallel.py:1000(__call__)
        1    0.584    0.584   54.886   54.886 parallel.py:960(retrieve)
    12039    0.013    0.000   54.181    0.005 pool.py:767(get)
    12039    0.008    0.000   54.160    0.004 pool.py:764(wait)
    12054    0.018    0.000   54.156    0.004 threading.py:589(wait)
      768    0.010    0.000   54.119    0.070 threading.py:288(wait)
     3126   54.105    0.017   54.105    0.017 {method 'acquire' of '_thread.lock' objects}
 58143528    7.986    0.000   43.820    0.000 abc.py:117(__instancecheck__)
 58143528   12.915    0.000   35.834    0.000 {built-in method _abc._abc_instancecheck}
 8188300/3054264   30.769    0.000   30.769    0.000 double_samuel_schubmult7.py:148(elem_sym_poly)
58143557/58143542    7.913    0.000   22.918    0.000 abc.py:121(__subclasscheck__)
58143557/58143542   15.006    0.000   15.006    0.000 {built-in method _abc._abc_subclasscheck}

The threading portion is for a small part of the code at the end that is unproblematic, most of the code is single-threaded.

Does this 43 seconds of abc __instancecheck__ time spent really mean that if I write it in, say, C, it will be at least 43 seconds faster? Is there a way to suppress it?

I should note that for polynomial calculations I'm using symengine, which could be where this is happening, or numpy. Below is the main function (schubmult).

from symengine import *
import numpy as np

..200 lines of omitted code..

def schubmult(perm_dict,v):
    vn1 = inverse(v)
    th = theta(vn1)
    if th[0]==0:
        return perm_dict        
    mu = permtrim(uncode(th))
    vmu = permtrim(mulperm(list(v),mu))
    inv_vmu = inv(vmu)
    inv_mu = inv(mu)
    ret_dict = {}
    vpaths = [([(vmu,0)],1)]
    while th[-1] == 0:
        th.pop()
    for i in range(len(th)):
        k = i+1
        vpaths2 = []
        for path,s in vpaths:
            last_perm = path[-1][0]
            newperms = kdown_perms(last_perm,th[i],k)
            for new_perm,s2,vdiff in newperms:
                new_perm2 = permtrim(new_perm)
                if i == len(th)-1 and (len(new_perm2) != 2 or new_perm2[0]!=1):
                    continue
                path2 = [*path,(new_perm2,vdiff)]
                vpaths2 += [(path2,s*s2)]
        vpaths = vpaths2
    arr0 = [0 for vpath in vpaths]
    for u,val in perm_dict.items():
        inv_u = inv(u)
        vpathsums = {u: val*np.array([vpath[1] for vpath in vpaths])}
        for index in range(len(th)):            
            newpathsums = {}
            for up, arr in vpathsums.items():
                inv_up = inv(up)
                newperms = elem_sym_perms(up,min(th[index],(inv_mu-(inv_up-inv_u))-inv_vmu),th[index])
                for up2, udiff in newperms:
                    newpathsums[up2] = newpathsums.get(up2,np.array(arr0))+arr*[elem_sym_func(th[index],index+1,up,up2,vpaths[i][0][index][0],vpaths[i][0][index+1][0],udiff,vpaths[i][0][index+1][1],var2,var3) for i in range(len(vpaths))]
            vpathsums = newpathsums
        if len(vpaths)<300:
            ret_dict = add_perm_dict({ep: np.sum(arr) for ep,arr in vpathsums.items()},ret_dict)
        else:
            ret_dict = add_perm_dict(dict(Parallel(n_jobs=-1,require='sharedmem')(delayed(pairsum)(ep,arr) for ep,arr in vpathsums.items())),ret_dict)            
    return ret_dict

The cumulative time includes time where that function/method was calling other functions/methods. We would need to see more of your code to be able to suggest why it may be spending so much time in that method in particular / how to potentially improve it. — dskrypa, Jan 03 '23 at 01:52
@dskrypa I'm happy to post code, I just wish there were some way to pinpoint exactly where that call is happening so I can post that part. The whole program (minus the arg parsing, etc.) is 255 lines, but schubmult (the part where it spends almost all of its time) is 50 lines. I'll try posting that. — Matt Samuel, Jan 03 '23 at 02:11
Before seeing your code, it looks like you may be using multithreading instead of multiprocessing for a cpu-intensive task. Each Python process can only use 1 (100%) CPU, so multi-threading is more useful for IO-bound tasks, while multi-processing is more useful for cpu-bound tasks. — dskrypa, Jan 03 '23 at 02:14
@dskrypa Yes, it's using threading. Multiprocessing didn't work out because it uses too much memory, but because it's using numpy using threading saved a lot of time, even though it's still only using one CPU. — Matt Samuel, Jan 03 '23 at 02:16
There are many minor optimizations that could be made, though they would likely not result in an improvement that crosses an order of magnitude in terms of run time. They could still make a difference though. Things like converting the `[elem_sym_func(...) for i in range(len(vpaths))]` to use a generator to reduce repeated dereferencing of `vpaths[i][0][index]` and redundant calculations for `index + 1` there, for example. — dskrypa, Jan 03 '23 at 02:40
The 3 biggest pain points based on the profiling that stand out the most to me appear to be the list comp where `elem_sym_func` is called, so the optimizations related to de-referencing there could help a bit. `elem_sym_func` itself seems to be another pain point that would be good to focus on. A large amount of time seems to be spent waiting for thread locks too. I'm not familiar with the way Parallel is working here, so that may be something where adjusting your multi-threading/processing strategy may help as well. — dskrypa, Jan 03 '23 at 02:43
I would strongly suggest splitting out the functionality for major blocks into separate functions, with fewer operations per line of code. It will make it easier for you to read and maintain it. — dskrypa, Jan 03 '23 at 02:45
`np.array([vpath[1] for vpath in vpaths])` appears to have a static value, but it is repeated on each iteration of the `for u, val in perm_dict.items():` loop. Separate from that, is there some part of the inner-most loop that calls `elem_sym_func` but looks at many non-changing values in `vpaths` that could be extracted to do some of that work once, then re-use the values for each iteration, rather than repeating the work on each iteration of the inner-most loop? — dskrypa, Jan 03 '23 at 02:52
@dskrypa Thanks for your reply, I will think on it and consider it (tomorrow, it's bedtime). elem_sym_func is computing the elementary symmetric polynomials, which you would think could be done once and cached, problem is it's with different variables every time, and computing the polynomial once, saving it, then substituting the variables is actually slower. It's also recursive, but making it iterative saves literally no time, because leaving it recursive allows for some more optimizations. — Matt Samuel, Jan 03 '23 at 02:58
@dskrypa Yes that's true about the vpath array being static, but the reason I didn't focus on that is because there's almost never more than one iteration of the outer loop, so at least in this case it's no optimization at all. — Matt Samuel, Jan 03 '23 at 03:02
@dskrypa Anecdotally (on one non-profiled run), `newpathsums[up2] = newpathsums.get(up2,np.array(arr0))+arr*[elem_sym_func(th[index],index+1,up,up2,vpath[0][index][0],vpath[0][index+1][0],udiff,vpath[0][index+1][1],var2,var3) for vpath in vpaths]` is slower, if that's what you meant. Having the generator on i seems to be faster. — Matt Samuel, Jan 03 '23 at 13:00
Almost. Honestly, I think there is too much here to cover effectively in the SO format. If you continue to reduce duplicate actions, and focus on profiling smaller blocks of code that may be possible to profile via the easier [timeit](https://stackoverflow.com/q/74495814) instead, then you will get there. — dskrypa, Jan 03 '23 at 13:09
@dskrypa I agree, the comment section is complaining. I'll try timeit. And it seems like it's the same speed, the processor just had to warm up, which complicates things. — Matt Samuel, Jan 03 '23 at 13:11
@dskrypa In case you're curious, I found an algorithmic optimization that increased the speed by a factor of 12. This case went from 4 minutes to 20 seconds. The algorithm above is highly inefficient, there's accumulation that can occur intermediately so that the size of the array in the last iteration of the loop becomes 1 (so there's no need to sum everything up at the end, as it is happening along the way). — Matt Samuel, Jan 06 '23 at 18:06

abc.py __instancecheck__ taking an inordinate amount of time for large calculation. Is there a way around it without abandoning python?

0 Answers0

abc.py instancecheck taking an inordinate amount of time for large calculation. Is there a way around it without abandoning python?