3

I have a pandas dataframe that looks like the following:

Type        Keywords 
----        --------
Animal      [Pigeon, Bird, Raccoon, Dog, Cat]
Pet         [Dog, Cat, Hamster]
Pest        [Rat, Mouse, Raccoon, Pigeon]
Farm        [Chicken, Horse, Cow, Sheep]
Predator    [Wolf, Fox, Raccoon]

Let's say that I have the following string:

input = "There is a dead rat and raccoon in my pool"

Given that I tokenize the string and remove stop-words so that it becomes

input = [Dead, Rat, Raccoon, Pool]

I need to go through each row and find the rows that have the highest number of keyword matches. With the given example, the results would look like the following:

Type        Keywords                            Matches
----        --------                            -------
Animal      [Pigeon, Bird, Raccoon, Dog, Cat]   1
Pet         [Dog, Cat, Hamster]                 0
Pest        [Rat, Mouse, Raccoon, Pigeon]       2
Farm        [Chicken, Horse, Cow, Sheep]        0
Predator    [Wolf, Fox, Raccoon]                1

The output would be the top three Type names that have the highest number of matches.

In the above case, since the "Pest" category has the highest number of matches, it would be selected as the highest match. Additionally both the Animal and Predator categories would be selected. The output in order would thus be:

output = [Pest, Animal, Predator]

Doing this task with nested for loops is easy, but since I have thousands of these kinds of rows, I'm looking for a better solution. (Additionally for some reason I have encountered a lot of bugs when using non in-built functions with pandas, perhaps it's because of vectorization?)

I looked at the groupby and isin functions that are inbuilt in pandas, but as far as I could tell they would not be able to get me to the output that I want (I would not be surprised at all if I am incorrect in this assumption).

I next investigated the usage of sets and hashmaps with pandas, but unfortunately my coding knowledge and current ability is not yet proficient enough to craft a solid solution. This StackOverflow link in particular got me much closer to what I wanted, though it didn't find the top three match row names.

I would greatly appreciate any help or advice.

Albert Lee
  • 565
  • 3
  • 7

2 Answers2

3

You may check isin

df['Matches']=pd.DataFrame(df.Keywords.values.tolist()).isin(s).sum(1)


df.loc[df['Matches']>0,'Type'].values.tolist()
BENY
  • 317,841
  • 20
  • 164
  • 234
2

It won't be very efficient to store and operate on lists in a DataFrame, that being said, we can use set intersection here:

Setup

s = set(['Dead', 'Rat', 'Raccoon', 'Pool'])

Now using a list comprehension (faster than apply):

out = df.assign(Matches=[len(set(el) & s) for el in df.Keywords])

<!- ->

       Type                           Keywords  Matches
0    Animal  [Pigeon, Bird, Raccoon, Dog, Cat]        1
1       Pet                [Dog, Cat, Hamster]        0
2      Pest      [Rat, Mouse, Raccoon, Pigeon]        2
3      Farm       [Chicken, Horse, Cow, Sheep]        0
4  Predator               [Wolf, Fox, Raccoon]        1

To find the three rows with the most matches:

out.loc[out.Matches.nlargest(3).index].Type.tolist()

['Pest', 'Animal', 'Predator']
user3483203
  • 50,081
  • 9
  • 65
  • 94