Parallelizing or otherwise speeding up calculation within generator and pandas dataframe

Question

I am doing a calculation on permutations of things from a generator created by itertools. I have a piece of code in this form (this is a dummy example):

import itertools
import pandas as pd

combos = itertools.permutations('abcdefghi',2)
results = []
i=0

for combo in combos:
    i+=1 #this line is actually other stuff that's expensive
    results.append([combo[0]+'-'+combo[1],i])

rdf = pd.DataFrame(results, columns=['combo','value'])

Except in the real code,

there are several hundred thousand permutations
instead of i+=1 I am opening files and getting results of clf.predict where clf is a classifier trained in scikit-learn
in place of i I'm storing a value from that prediction

I think the combo[0]+'-'+combo[1] is trivial though.

This takes too long. What should I do to make it faster? Such as:

1) writing better code (maybe I should initialize results with the proper length instead of using append but how much will that help? and what's the best way to do that when I don't know the length before iterating through combs?)

2) initializing a pandas dataframe instead of a list and using apply?

3) using cython in pandas? Total newbie to this.

4) parallelizing? I think I probably need to do this, but again, total newbie, and I don't know whether it's better to do it within a list or a pandas dataframe. I understand I would need to iterate over the generator and initialize some kind of container before parallelizing.

Which combination of these options would be best and how can I put it together?

`rdf = pd.DataFrame(itertools.permutations('abcdefghi',2), columns=['combo','value'])` best I can think of. — roganjosh, Nov 07 '18 at 21:18
Actually, that doesn't join the strings. I suspect that's the bottleneck. `i` is superfluous, you could use `enumerate` — roganjosh, Nov 07 '18 at 21:19
No that's the thing, the i+=1 is just a dummy example. In reality, what's there is very expensive. In place of i+=1, I'm opening files from the names in the permutation, reading them in as dataframes, running clf.predict from a classifier trained in sci-kit learn and storing a value from the prediction. That can't really change so I need to make everything else faster and/or parallelize — andbeonetraveler, Nov 07 '18 at 21:26
@roganjosh I edited the question to try to make this clearer — andbeonetraveler, Nov 07 '18 at 21:29
"maybe I should initialize results with the proper length instead of using append but how much will that help?" No help at all. — juanpa.arrivillaga, Nov 07 '18 at 21:38
" I am opening files and getting results of clf.predict" that is almost certainly the bottleneck. How many classifiers are you working with? A dozen? Hundreds? Thousands? I am guessing they are pickled in some way? — juanpa.arrivillaga, Nov 07 '18 at 21:42
@juanpa.arrivillaga I have one classifier (I train once) and then make several hundred thousand predictions, one for each permutation. — andbeonetraveler, Nov 07 '18 at 21:43
@andbeonetraveler then `multiprocessing` might be a viable approach. There will be some overhead, though, of all inter-process communication — juanpa.arrivillaga, Nov 07 '18 at 22:06
@juanpa.arrivillaga Yeah I need to parallelize this. I had taken a look at the multiprocessing module but I'm rather lost. Can you point me to a tutorial? What kind of container would be best to intitalize--a list, pandas dataframe, or something else? — andbeonetraveler, Nov 12 '18 at 18:08

kevins_1 · Answer 1 · 2018-11-07T21:47:53.783

-1

The append operation in pandas and for loop are slow. This code avoids using it.

import itertools
import pandas as pd

combos = itertools.permutations('abcdefghi',2)
combo_values = [('-'.join(x[1]), x[0]) for x in enumerate(combos, 1)]

rdf = pd.DataFrame({'combos': [x[0] for x in combo_values],
                    'value': [x[1] for x in combo_values]})

You can do this for each file and dataframe that you have then use pd.concat to quickly generate results thereafter. You can also add the enumeration of the permutations afterward if you want.

edited Nov 07 '18 at 21:47

answered Nov 07 '18 at 21:30

kevins_1

1,268
2
9
27

1

They don't `append` to a dataframe.... The dataframe constructor is called after the list is complete. – roganjosh Nov 07 '18 at 21:38
They don't use `.append` on the data-frame. You are still using a "for loop", it's just a list comprehension now. While this might be marginally faster, it isn't really a solution, since almost certainly, the bottleneck is not actually building the list. – juanpa.arrivillaga Nov 07 '18 at 21:40
The minimum working code does not need the append. This solution works and does not need append operation. – kevins_1 Nov 07 '18 at 21:40
1

@kevins_1 yes, but it offers no significant advantage. If anything, yours is slower because of the way you are initializing your data-frame, but all of this is pretty irrelevant. – juanpa.arrivillaga Nov 07 '18 at 21:40
The advantage is significant and shows that the operation listed above for 1) is a good start. It is also faster than 'apply': https://tomaugspurger.github.io/modern-4-performance – kevins_1 Nov 07 '18 at 21:43
1

The advantage is not significant at all. In any event, you haven't even bothered to profile it. What does this have to do with `.apply`? You seem to be thoroughly confused. What in that link do you think is relevant to your answer? – juanpa.arrivillaga Nov 07 '18 at 21:46
1

Where does the OP call _apply_? Whatever you've taken away from that article, I'm pretty sure you've got some concepts confused. – roganjosh Nov 07 '18 at 21:47
As OP's question states _apply_ is mentione in 2) above. Here is a reference for how a list comprehension is faster than an append. I'd expect cython and parallelizing are also viable. https://stackoverflow.com/questions/22108488/are-list-comprehensions-and-functional-functions-faster-than-for-loops – kevins_1 Nov 07 '18 at 21:54
@kevins_1 list comprehensions are *marginally* faster than append, and in this case, would make no appreciable difference. From your link: "A list comprehension is usually a tiny bit faster than the precisely equivalent for loop " – juanpa.arrivillaga Nov 07 '18 at 22:00

Parallelizing or otherwise speeding up calculation within generator and pandas dataframe

1 Answers1