2

I currently have a dataframe with a column that contains some words or chars, im trying to categorize each row by search keywords in that corresponding cell.

example

  words             |   category
-----------------------------------
im a test email     |  email
here is my handout  |  handout

here is what i have

conditions = [
        (df['words'].str.contains('flyer',False,regex=True)),
        (df['words'].str.contains('report',False,regex=True)),
        (df['words'].str.contains('form',False,regex=True)), 
        (df['words'].str.contains('scotia',False,regex=True)),  
        (df['words'].str.contains('news',False,regex=True)), 
         (df_prt_copy['words'].str.contains('questions.*\.pdf',False,regex=True)),
         .
         .
         .
         .
    ]
    choices = ['open house flyer', 
               'report', 
               'form', 
               'report',
               'news', 
               ‘question',
                  .
                  .
                  .
                  .
              ]
     df['category']=np.select(conditions, choices, default='others')

this works fine, but problem is that i have lots of keywords(probably over 120 or so), so maintaining this keywords list is very difficult, is there any better way to do this ? btw, i'm using python3

note: im looking for a easier method to manage a large list of keywords, which is different from simply a method to find keywords here

ikel
  • 1,790
  • 6
  • 31
  • 61
  • Does this answer your question? [How to test if a string contains one of the substrings in a list, in pandas?](https://stackoverflow.com/questions/26577516/how-to-test-if-a-string-contains-one-of-the-substrings-in-a-list-in-pandas) or [pandas dataframe str.contains() AND operation](https://stackoverflow.com/questions/37011734/pandas-dataframe-str-contains-and-operation?rq=1) – Trenton McKinney Nov 07 '19 at 04:43
  • no, that is only suitable for small number of keywords, im looking for a easier method for a large list of keywords – ikel Nov 08 '19 at 02:25

3 Answers3

1

You could join all your keywords and use str.findall in case you have multiple keywords in one line, and then map to a dict of cond vs choices:

df = pd.DataFrame({"words":["im a test email",
                            "here is my handout",
                            "This is a flyer"]})

choices = {"flyer":"open house flyer",
           "email":"email from someone",
           "handout":"some handout"}

df["category"] = df["words"].str.findall("|".join(choices.keys())).str.join(",").map(choices)

print (df)

#
                words            category
0     im a test email  email from someone
1  here is my handout        some handout
2     This is a flyer    open house flyer
Henry Yik
  • 22,275
  • 4
  • 18
  • 40
  • could this be done with a dictionary? I’m having issue to match the sequence of keywords and corresponding categories when putting them together, because there are too many of them – ikel Nov 07 '19 at 05:27
  • I will test it today and get back here, thank you @Henry Yik – ikel Nov 07 '19 at 13:24
  • some words are embeded, for example "todayIgotAemailReport", this does not return "email" category, i guess this method dose not apply regex, any way to do it? – ikel Nov 08 '19 at 03:17
  • why doesn't it return email category? – Henry Yik Nov 08 '19 at 03:21
  • i guess todayIgotAemailReport is one word which does not match word 'email', my original method has regex enabled, so this is not a problem – ikel Nov 08 '19 at 03:54
  • I take that you didn't even test it? [`str.findall`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.findall.html#pandas-series-str-findall) is equivalent to `re.findall`. – Henry Yik Nov 08 '19 at 03:57
  • i did test it on all my keywords, otherwise how could i know the result? i actually compared with result from my original method, that's why i found it doesn't do regex somehow. maybe there is somewhere or something wrong – ikel Nov 08 '19 at 04:07
  • It is not possible to fail extracting `email` from `todayIgotAemailReport` using `str.findall`, unless your key or messages contains capital letters, which can be solved by adding `flags=re.IGNORECASE` just like regular regex. – Henry Yik Nov 08 '19 at 04:11
  • ok i just tested with your example with little change df = pd.DataFrame({"words":["im a test email", "here is my handout", "This is a emailflyer"]}) choices = {"flyer":"open house flyer", "email":"email from someone", "handout":"some handout"} df["category"] = df["words"].str.findall("|".join(choices.keys()),re.I).str.join(",").map(choices) – ikel Nov 08 '19 at 04:14
  • the result is following " This is a emailflyer | NaN" – ikel Nov 08 '19 at 04:15
  • Because there is no correct mapping for `email,flyer` in your `dict`. – Henry Yik Nov 08 '19 at 04:18
  • oh i see what you meant now, thanks for clarifying, answer marked – ikel Nov 08 '19 at 04:25
1

you can use flashtext..

 import pandas as pd
 from flashtext import KeywordProcessor

 keyword_dict = {
 'programming': ['python', 'pandas','java','java_football'],
 'sport': ['cricket','football','baseball']
 } 

 kp = KeywordProcessor()
 kp.add_keywords_from_dict(keyword_dict)
 df = pd.DataFrame(['i love working in python','pandas is very popular library','i love playing football'],columns= ['text'])

 df['category'] = df['text'].apply(lambda x: kp.extract_keywords(x, span_info = True))

enter image description here

now coming to problem for word like 'todayIgotAemailReport' you can refer to How to split text without spaces into list of words? think this might help you for splitting any type of unknown join word

import wordninja
' '.join(wordninja.split('todayIgotAemailReport'))

#this will break this into their respective word which can make your stuff easy, while searching
#op
'today I got A email Report' 
qaiser
  • 2,770
  • 2
  • 17
  • 29
  • tried this method, but java_footbal does not get recognized – ikel Nov 08 '19 at 02:14
  • one more thing, this method only find first match, right? – ikel Nov 08 '19 at 03:02
  • also, i noticed that flashtext seems faster, but is there any way to do it with regex? right now as i tested, "todayIgotAemailReport" does not get recognized as email category – ikel Nov 08 '19 at 03:34
  • no , it will give you all match, kp.extract_keywords(x) this give list and i have selected item which is at index zero, that why when no keyword are found it is throwing error because list is empty – qaiser Nov 08 '19 at 04:59
  • @ikel i have modified the code, and include span_info = True , so that you can get position of the word found – qaiser Nov 08 '19 at 05:28
0

You could have created conditions list dynamically. If you have a list of keywords, say key_words, you could for loop through the list of keywords, and append conditions like (df['words'].str.contains(key_words[iter], False, regex=True)) to the list conditions.

Anant Mittal
  • 1,923
  • 9
  • 15
  • in that case, i still have to match the sequence in "choices" list though, and that list is supposed to be like a category list, im hoping some way to possibly use dict to replace those 'choices' and 'condition' lists – ikel Nov 07 '19 at 05:05