I am doing a calculation on permutations of things from a generator created by itertools. I have a piece of code in this form (this is a dummy example):
import itertools
import pandas as pd
combos = itertools.permutations('abcdefghi',2)
results = []
i=0
for combo in combos:
i+=1 #this line is actually other stuff that's expensive
results.append([combo[0]+'-'+combo[1],i])
rdf = pd.DataFrame(results, columns=['combo','value'])
Except in the real code,
- there are several hundred thousand permutations
- instead of
i+=1
I am opening files and getting results ofclf.predict
whereclf
is a classifier trained in scikit-learn - in place of
i
I'm storing a value from that prediction
I think the combo[0]+'-'+combo[1]
is trivial though.
This takes too long. What should I do to make it faster? Such as:
1) writing better code (maybe I should initialize results
with the proper length instead of using append
but how much will that help? and what's the best way to do that when I don't know the length before iterating through combs
?)
2) initializing a pandas dataframe instead of a list and using apply
?
3) using cython in pandas? Total newbie to this.
4) parallelizing? I think I probably need to do this, but again, total newbie, and I don't know whether it's better to do it within a list or a pandas dataframe. I understand I would need to iterate over the generator and initialize some kind of container before parallelizing.
Which combination of these options would be best and how can I put it together?