text pattern recognition in pandas dataframe

Question

I am trying to get python to match a text pattern in pandas dataframe.

What i am doing is

list = ['sarcasm','irony','humor']
pattern = '|'.join(list)
pattern2 = str("( " + pattern.strip().lstrip().rstrip() + " )").strip().lstrip().rstrip()

frame = pd.DataFrame(docs_list, columns=['words'])
# docs_list is the list containing the snippets

#Skipping the inbetween steps for the simplicity of viewing
cp2 = frame.words.str.extract(pattern2)
c2 = cp2.to_frame().fillna("No Matching Word Found")

Which gives an output like this

Snips                                     pattern_found    matching_Word
A different type of humor                    True             humor
A different type of sarcasm                  True             sarcasm 
A different type of humor and irony          True             humor
A different type of reason                   False            NA
A type of humor and sarcasm                  True             humor
A type of comedy                             False            NA

So, python checks for the pattern and gives the corresponding output.

Now, here is my problem. As per my understanding, as long as python does not encounter a word from the pattern in the snippet, it keeps on checking for the entire pattern. As soon as it encounters a part of the pattern, it takes that part and skips the remaining words.

How do i make python to look for every word rather than just the first matching word, in order that it outputs like thus?

Snips                                     pattern_found    matching_Word
A different type of humor                    True             humor
A different type of sarcasm                  True             sarcasm 
A different type of humor and irony          True             humor
A different type of humor and irony          True             irony
A different type of reason                   False            NA
A type of humor and sarcasm                  True             humor
A type of humor and sarcasm                  True             sarcasm
A type of comedy                             False            NA

A simple solution would obviously be to put the pattern in a list and iterate over a for loop by checking for every word in every snippet. But time is a constraint. especially because the data set i am dealing with is huge and the snips are fairly long.

Did you check [`extractall`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extractall.html#pandas.Series.str.extractall)? — Wiktor Stribiżew, May 02 '17 at 07:12
BTW, do you know the space in your `pattern2` is meaningful? You need to remove the spaces from `"( "` and `" )"`. The way you define a regex, you might as well use `pattern = r'({})'.format('|'.join(list))`. However, since the alternation is not anchored, you need to sort the items by length descending. — Wiktor Stribiżew, May 02 '17 at 07:44
No, i had not tried extractall before. But now i did, combined with the other answer provided, and it works. — M PAUL, May 02 '17 at 10:19

jezrael · Accepted Answer · 2017-05-02T08:04:41.590

For me works extractall with reset_index for remove level of MultiIndex, last join to original.

L = ['sarcasm','irony','humo', 'humor', 'hum']
#sorting by http://stackoverflow.com/a/4659539/2901002
L.sort()
L.sort(key = len, reverse=True)
print (L)
['sarcasm', 'humor', 'irony', 'humo', 'hum']

pattern2 = r'(?P<COL>{})'.format('|'.join(L))
print (pattern2)
(?P<COL>sarcasm|irony|humor|humo|hum)

cp2 = frame.words.str.extractall(pattern2).reset_index(level=1, drop=True)
print (cp2)
       COL
0    humor
1  sarcasm
2    humor
2    irony
4    humor
4  sarcasm

frame = frame.join(cp2['COL']).reset_index(drop=True)
print (frame)
                                 words pattern_found matching_Word      COL
0            A different type of humor          True         humor    humor
1          A different type of sarcasm          True       sarcasm  sarcasm
2  A different type of humor and irony          True         humor    humor
3  A different type of humor and irony          True         humor    irony
4           A different type of reason         False           NaN      NaN
5          A type of humor and sarcasm          True         humor    humor
6          A type of humor and sarcasm          True         humor  sarcasm
7                     A type of comedy         False           NaN      NaN

And if your input contains `L = ['sarcasm','irony','humo', 'humor', 'hum']`? It won't work any longer. — Wiktor Stribiżew, May 02 '17 at 07:49
@WiktorStribiżew - You are right, unfortunately. I am not regex expert, so now I have NO solution for it. — jezrael, May 02 '17 at 07:55
Well, I already shared what needs to be done in my second comment to the question. Just sort the `L` list by length in descending order, then join with `|`. — Wiktor Stribiżew, May 02 '17 at 07:57

text pattern recognition in pandas dataframe

1 Answers1