2

We use pandas dataframe in our project and we realized that our program is very slow due to pandas dataframe's slow calculations. I shared our code with you.

   df_item_in_desc = pd.DataFrame(columns = df.columns) # to hold all satisfied results
    
    for index in df.shape[0]:
        s1 = set(df.iloc[index]['desc_words_short'])
    
        if item_number in s1:   
            df_item_in_desc = df_item_in_desc.append(df.iloc[index])

We check that if item name is in another column desc_words_short then we append that row to another dataframe (df_item_in_desc). This is simple logic but to get such rows we should iterate over all dataframe and check that condition. Our dataframe is a bit large and running this code takes more time. How can we speed up this process, can we use Cpu parallelization in this task, or something else?

Note: We actually tried Cpu parallelization and wouldn't be successful.

Ali
  • 99
  • 2
  • 9
  • what is the memory of the machine you are running this on and what is the size of your largest dataset? – รยקคгรђשค Jul 07 '21 at 03:27
  • Ram is 8 GB and there are about 100000 samples in our dataset. Actually the size of our dataset is less than 100 MB, not so large. – Ali Jul 07 '21 at 03:31
  • Can you give an example of your data? How diverse are the items? You could maybe try to first build a set of the items per column to avoid having to loop over all rows at each iteration. Note however that efficiency of various strategies to achieve your goal will differ greatly depending on the nature of the data. – mozway Jul 07 '21 at 03:36
  • @mozway thanks for the response. Actually the variable `item_number` holds item's name and the column `desc_words_short` holds list of strings and we wanted to check if the item name is in the column `desc_words_short`. – Ali Jul 07 '21 at 03:51
  • OK, but this doesn't address the question on the type of data. Depending on whether you have a lot of different items, many similar ones, duplicates, etc. will change how to try solving your issue. As your goal is to optimize speed, I think you cannot expect a decent response without providing a dataset. – mozway Jul 07 '21 at 03:56

2 Answers2

1

so it looks like you're looping through each row and looking at the value of the desc_words_short column. And for each value, if that value (presumably a list) contains item_number then you want to add that row to df_item_in_desc.

If that is the goal, you may be able to speed it up like this:

import pandas as pd

item_number = 'a'
df = pd.DataFrame({'desc_words_short':[['a','a','b'],['b','d'],['c','c']]})

print(df)

  desc_words_short
0        [a, a, b]
1           [b, d]
2           [c, c]

mask = df['desc_words_short'].apply(lambda x: item_number in x)
df_item_in_desc = df.loc[mask]

print(df_item_in_desc)

  desc_words_short
0        [a, a, b]

I'm not sure what the point of set is, as item_number would be in either the full list or the set, so it's a pointless additional computation

Derek Eden
  • 4,403
  • 3
  • 18
  • 31
1

You can also use list comprehension. We should avoid using df.apply and have it as a last resort.

On larger datasets, list comprehension will be faster. Benchmarks in answer here: link. The answer itself is a gem of wisdom.

Quoting the benchmark:

%timeit df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
%timeit df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]

2.85 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
788 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
import pandas as pd
item_number = 'a'
df = pd.DataFrame({'desc_words_short':[['a','a','b'],['b','d'],['c','c']]})
df[[ item_number in x for x in df['desc_words_short']]]

Dataframe:

  desc_words_short
0        [a, a, b]
1           [b, d]
2           [c, c]

Output:

  desc_words_short
0        [a, a, b]

Reference: https://stackoverflow.com/a/54432584/6741053

รยקคгรђשค
  • 1,919
  • 1
  • 10
  • 18
  • I think this is really good point that you touch. I will also check with this way. Thanks a lot for the answer @รยקคгรђשค – Ali Jul 07 '21 at 08:21
  • @Ali Happy to help. It would be good if you can compare both approaches and provide benchmarks in a separate answer, would be helpful to future visitors. – รยקคгรђשค Jul 07 '21 at 10:00