0

I have a pandas data frame, in which basically only two columns are important. The column 'Name' and the other one 'Cost'.

I have different categories for my costs. For each I have list of keywords. Based on these keywords I find its related rows in the dataframe:

 a = df[df['Name'].str.contains('|'.join(keywords),case=False)] 

and then I calculate the sum of Cost values in those rows to get that category cost:

sum_ = 0
for index, row in a.iterrows():
    cost= float(row['Cost'])
    sum_ += cost

The problem is with this approach, I never know if a certain row has been considered multiple times or if at the end a row is missed and wasn't allocated to any category.

My question is first how to get indexes of the filtered/chosen rows when using str.contain and then how to check if the rows has been previously used in another category.

Thank you so much.

Niel_Eenterm
  • 128
  • 8
  • 2
    Try running just the inner portion of your boolean indexing, `df['Name'].str.contains('|'.join(keywords),case=False)` and you'll see that you get the mask itself, a series of `True/False` by index. You can construct each mask this way, then you can compare indices before applying the mask(s) back to the dataframe – G. Anderson Apr 23 '20 at 20:06
  • Hey Niel, Welcome to SO. Please share expected output. – Mayank Porwal Apr 23 '20 at 20:07
  • 1
    IIUC the proper tool here would be to use `np.select` that way you can assign each row to only one category, giving a priority to the conditions you specify first. This is similar to `if-elif-else` logic: https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column – ALollz Apr 23 '20 at 20:12
  • @MayankPorwal, actually I have no idea how to handle this problem and about a proper output. an I know it makes it hard to answer as well. I'm looking for ideas rather than a specific line of code. – Niel_Eenterm Apr 23 '20 at 20:16
  • @G.Anderson Thank you I will try it. – Niel_Eenterm Apr 23 '20 at 20:16
  • @ALollz this is an interesting point of view. The problem is with this approach the conditions should be written beforehand. In my case the number of categories and keywords are not defined in the code. They are defined in a config.json file. So the number of categories are not always the same. – Niel_Eenterm Apr 23 '20 at 20:39
  • _My question is first how to get indexes of the filtered/chosen rows when using str.contain_ You already have that information when you're doing `df[df['Name'].str.contains('|'.join(keywords),case=False)] `. _I never know if a certain row has been considered multiple times_ .... _check if the rows has been previously used in another category_ What do you mean? – AMC Apr 23 '20 at 21:32
  • @AMC to elaborate more: I want to allocate my transactions to different categories (based on keyword lists for different categories). At the end no transcation should be unassigned and also no two categories should overlap each other. The way I am doing it is using a for loop to iterate over all keyword lists. If it is not clear let me know. – Niel_Eenterm Apr 24 '20 at 09:36

0 Answers0