0

i have this code:

import pandas as pd
from itertools import product

for a, b, c, d, e in product(range(x), range(y), range(z), range(t), range(m):
    factor = foo(a, b, c, d, e)
    result_df.loc[len(result_df.index)] = [a, b, c, d, e, factor]

In which I use itertools.product to generate 5 variables and then use those 5 in a foo function. then append the results to a dataframe.

the foo function is fully optimized and uses vectorization and numpy in every calculation.

is there any way to make this code run faster?

Edit: so apparently using df.loc to append is very slow. what do you suggest? how can I store every iteration's a,b,c,d,e & factor, and then make a dataframe out of it?

Edit 2: So as other guys mentioned I used list append instead of appending to dataframe, then made a dataframe out of the list at the end. with below code:

import pandas as pd
from itertools import product

res_list=[]

for a, b, c, d, e in product(range(x), range(y), range(z), range(t), range(m):
    factor = foo(a, b, c, d, e)
    res_list.append([a, b, c, d, e, factor])

col_list = ['x', 'y', 'z', 't', 'm', 'factor']
res_df = pd.DataFrame(res_list, columns=col_list)

and its faster now, but not that much faster. its roughly 10% quicker. any other tips?

  • 7
    `result_df.loc[len(result_df.index)] = ...` is very slow. Appending one row to a Pandas dataframe requires copying the entire dataframe. It's better to store rows in a list, then convert into a dataframe at the end. – Nick ODell Aug 15 '23 at 21:10
  • 1
    What dataframe library are you using? If Pandas, this is probably a duplicate of [Create a Pandas Dataframe by appending one row at a time](/q/10715965/4518341). See also [Error "'DataFrame' object has no attribute 'append'"](/q/75956209/4518341). (They removed `append` precisely because it's so slow. What you're using is a workaround at best.) BTW, welcome to Stack Overflow! Check out the [tour], and [ask] for tips like including all relevant tags, providing a [mre], and how to write a good title. – wjandrea Aug 15 '23 at 21:17
  • using pandas. Thanks! – Mohsen Shahhosseini Aug 15 '23 at 22:19

1 Answers1

0

As mentioned in the comments, don't incrementally grow a dataframe. That is inefficient. Use a list. You can just do something like:

df = pd.DataFrame(
    [
        [*tup, foo(*tup)]
        for tup in product(range(x), range(y), range(z), range(t), range(m))
    ]
)

Then you can gives the appropriate names to the columns (you could do that upfront as well).

Note, the list comprehension here just does a basic loop with append. It is the same as:

result = []
for for tup in product(range(x), range(y), range(z), range(t), range(m)):
    result.append([*tup, foo(*tup)])
df = pd.DataFrame(data)

There is nothing wrong with this approach, but note, now the list is still referenced in whatever scope you are doing this in, which could unnecessarily keep it around. This is why you should do this sort of thing in a function.

Also note, for large values of x, y, z,t,and m, this will always be slow because of combinatorial explosion.

juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • yes i know it will be slow. im running the code on a cloud server and it takes around 10 days to complete running. just wanted to know if I could reduce it to like 6-7 days. – Mohsen Shahhosseini Aug 15 '23 at 21:50