0

I would like to extract the word like this:

a dog ==> dog
some dogs ==> dog
dogmatic ==> None

There is a similar link: Extract substring from text in a pandas DataFrame as new column

But it does not fulfill my requirements.

From this dataframe:

df = pd.DataFrame({'comment': ['A likes cat', 'B likes Cats',
                               'C likes cats.', 'D likes cat!', 
                               'E is educated',
                              'F is catholic',
                              'G likes cat, he has three of them.',
                              'H likes cat; he has four of them.',
                              'I adore !!cats!!',
                              'x is dogmatic',
                              'x is eating hotdogs.',
                              'x likes dogs, he has three of them.',
                              'x likes dogs; he has four of them.',
                              'x adores **dogs**'
                              ]})

How to get correct output?

                            comment      label EXTRACT
0                           A likes cat   cat     cat
1                          B likes Cats   cat     cat
2                         C likes cats.   cat     cat
3                          D likes cat!   cat     cat
4                         E is educated  None     cat
5                         F is catholic  None     cat
6    G likes cat, he has three of them.   cat     cat
7     H likes cat; he has four of them.   cat     cat
8                      I adore !!cats!!   cat     cat
9                         x is dogmatic  None     dog
10                 x is eating hotdogs.  None     dog
11  x likes dogs, he has three of them.   dog     dog
12   x likes dogs; he has four of them.   dog     dog
13                    x adores **dogs**   dog     dog

NOTE: The column EXTRACT gives wrong answer, I need like the column label.

enter image description here

BhishanPoudel
  • 15,974
  • 21
  • 108
  • 169

6 Answers6

4

We can use str.extract with negative lookahead: ?!. We check if the the characters after the match are not more than 2 letters. For example dogmatic:

After that we use np.where with positive lookahead. The pseudo logic is like following:

All the rows which have "dog" or "cat" with alphabetic characters in front of it will be be replaced by NaN

words = ['cat', 'dog']

df['label'] = df['comment'].str.extract('(?i)'+'('+'|'.join(words)+')(?![A-Za-z]{2,})')
df['label'] = np.where(df['comment'].str.contains('(?<=\wdog)|(?<=\wcat)'), np.NaN, df['label'])

Output

                                comment label
0                           A likes cat   cat
1                          B likes Cats   Cat
2                         C likes cats.   cat
3                          D likes cat!   cat
4                         E is educated   NaN
5                         F is catholic   NaN
6    G likes cat, he has three of them.   cat
7     H likes cat; he has four of them.   cat
8                      I adore !!cats!!   cat
9                         x is dogmatic   NaN
10                 x is eating hotdogs.   NaN
11  x likes dogs, he has three of them.   dog
12   x likes dogs; he has four of them.   dog
13                    x adores **dogs**   dog
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • `hotdogs` is supposed to fail. – MonkeyZeus Aug 27 '19 at 15:08
  • So I should downvote? OP supplied a good set of test data unlike many other questions I find on here. It would be understandable if you failed to match unspecified edge-cases but the least you can do is match what OP presented. – MonkeyZeus Aug 27 '19 at 15:10
  • There you go @MonkeyZeus – Erfan Aug 27 '19 at 15:12
  • @Erfan I truly appreciate your input and upvoted, but still I have about million rows and `hotdog` is just one example of edge case. Is there a way to remove these larger words containing small words? – BhishanPoudel Aug 27 '19 at 15:14
  • There we go, edited. Plus explained the logic. This should work @MilkyWay007 – Erfan Aug 27 '19 at 15:17
  • I have to say that this is a nice approach but it is a bit like reinventing the stemmer ;-) – user2672299 Aug 27 '19 at 15:20
  • @user2672299 You can post your solution using `nltk stemmer`, I would gladly upvote and check as answer. – BhishanPoudel Aug 27 '19 at 15:23
  • @Erfan, This almost solves the problem, I added one more NaN case for cat like dogs and solved the question. Thanks a million. Appreciate your help. `df['label'] = np.where(df['comment'].str.contains('(?<=\wcat)'), np.NaN, df['label'])` – BhishanPoudel Aug 27 '19 at 15:24
  • @MilkyWay007 see my edit with the `|` or operator which is more elegant – Erfan Aug 27 '19 at 15:27
  • 1
    Nice, now I feel comfortable supporting this answer. – MonkeyZeus Aug 27 '19 at 15:30
2

What you are trying to achieve is extracting the label of your sentence. It is a natural language processing problem not a programming problem.

Approaches:

  1. Use a stemmer/lemmatizer . You could match the output of the stemmer with your stemmed class name list. This will most likely not give you a high enough accuracy.
  2. Train a machine learning classifier on your topics/labels.

Lemmatizer solution - I used some preprocessing code from another answer in this question

import nltk
import pandas as pd

lemma = nltk.wordnet.WordNetLemmatizer()
nltk.download('wordnet')


df = pd.DataFrame({'comment': ['A likes cat', 'B likes Cats',
                           'C likes cats.', 'D likes cat!', 
                           'E is educated',
                          'F is catholic',
                          'G likes cat, he has three of them.',
                          'H likes cat; he has four of them.',
                          'I adore !!cats!!',
                          'x is dogmatic',
                          'x is eating hotdogs.',
                          'x likes dogs, he has three of them.',
                          'x likes dogs; he has four of them.',
                          'x adores **dogs**'
                          ]})

word_list = ["cat",  "dog"]    # words (and all variations) that you wish to check for
word_list = list(map(lemma.lemmatize, word_list))


df["label"] = df["comment"].str.lower().str.replace('[^a-zA-Z]', ' ').apply(lambda x: [ lemma.lemmatize(word) for word in x.split()  ])
df["label"] = df["label"].apply(lambda x: [i for i in word_list if i in x])

df["label"] = df["label"].apply(lambda x: None if not x else x)
print(df)
user2672299
  • 414
  • 2
  • 12
  • 1
    Thanks for suggestion. Actually I am going to build the classifer after I get the labels. But first I need to get the labels. Do you have any links how to create labels from sentences like this? – BhishanPoudel Aug 27 '19 at 14:59
  • Could you mine twitter API for sentences with hash tags that you need? – user2672299 Aug 27 '19 at 15:03
  • You will need to make a web search. I am not aware of all possibilities to generate labeled data to you specific classification problem. Sorry for not being more helpful. – user2672299 Aug 27 '19 at 15:06
  • 1
    Agree with this answer. Unless you have more specific description of what you want to do, the problem seems too general. Even handling plurals in English is not that easy (dwarf -> dwarves, fox -> foxes, but not does -> doe) – justhalf Aug 27 '19 at 15:14
  • I think this is the better answer because it takes most exceptions into account. – user2672299 Aug 27 '19 at 17:11
  • @justhalf I am doing binary classification problem where I am only interested in whether it is dog or cat and want to label them, `cat`, `dog` or `None`. No any other animals are included in the classification. – BhishanPoudel Aug 28 '19 at 14:32
  • 1
    Then your question should specify that you want to recognize "dog, dogs, cat, cats, but nothing else". That's a much clearer question, and much easier to answer. – justhalf Aug 28 '19 at 16:36
  • I agree @justhalf – user2672299 Aug 28 '19 at 17:32
2
df = pd.DataFrame({'comment': ['A likes cat', 'B likes Cats',
                           'C likes cats.', 'D likes cat!', 
                           'E is educated',
                          'F is catholic',
                          'G likes cat, he has three of them.',
                          'H likes cat; he has four of them.',
                          'I adore !!cats!!',
                          'x is dogmatic',
                          'x is eating hotdogs.',
                          'x likes dogs, he has three of them.',
                          'x likes dogs; he has four of them.',
                          'x adores **dogs**'
                          ]})

word_list = ["cat", "cats", "dog", "dogs"]    # words (and all variations) that you wish to check for

df["label"] = df["comment"].str.lower().str.replace('[^\w\s]','').str.split().apply(lambda x: [i for i in word_list if i in x])
df["label"] = df["label"].apply(lambda x: None if not x else x)
df["label"] = df["label"].str.replace("[","").str.replace("]","").str.replace("'","").str.replace("s","")

Then that gives you:

df
    comment                             label
0   A likes cat                         cat
1   B likes Cats                        cat
2   C likes cats.                       cat
3   D likes cat!                        cat
4   E is educated                       None
5   F is catholic                       None
6   G likes cat, he has three of them.  cat
7   H likes cat; he has four of them.   cat
8   I adore !!cats!!                    cat
9   x is dogmatic                       None
10  x is eating hotdogs.                None
11  x likes dogs, he has three of them. dog
12  x likes dogs; he has four of them.  dog
13  x adores **dogs**                   dog
Ted
  • 1,189
  • 8
  • 15
1

Something like this?

/^(.*?[^a-z\r\n])?((cat|dog)s?)([^a-z\r\n].*?)?$/gmi

\2 will contain one of: cat, dog, cats, dogs

https://regex101.com/r/Tt3MiZ/3

MonkeyZeus
  • 20,375
  • 4
  • 36
  • 77
  • df.comment.str.extract(regexp, expand=False) gives all NaNs. How to implement this in pandas? – BhishanPoudel Aug 27 '19 at 15:05
  • @MilkyWay007 I have no idea to be honest, I've never done Python. In the regex101 link you can click "Code Generator" to get an example of how the site achieved its results so you can conform that to your needs. I am sure there are plenty of Pandas question on SO for you to reference. – MonkeyZeus Aug 27 '19 at 15:07
-1

In this case I imagine you don't even need to use regex. Just use the equal-to operator == to specify the exact match since you're looking for "dog" "dogs" "cat" "cats" as the entire word. For example:

for word in string:
    if word == "dogs":
        print("Yes")
    else:
        print("No")

If your string were "he likes hotdogs", the above loop would return "No"

-2

You can compile regex for cat, cats, dog and dogs.

import re
regex = re.compile(r'cats', re.I)
data = ['A likes cat', 'B likes Cats',
                           'C likes cats.', 'D likes cat!', 
                           'E is educated',
                          'F is catholic',
                          'G likes cat, he has three of them.',
                          'H likes cat; he has four of them.',
                          'I adore !!cats!!',
                          'x is dogmatic',
                          'x is eating hotdogs.',
                          'x likes dogs, he has three of them.',
                          'x likes dogs; he has four of them.',
                          'x adores **dogs**'
                          ]
for i in data:
    t = regex.search(i)
    print(t)
P K
  • 1
  • 2