Bigram Finder for Pandas Dataframe

Question

I have a list of bigrams.
I have a pandas dataframe containing a row for each document in my corpus. What I am looking to do is get the bigrams that match from my list in each document into a new column in my dataframe. What is the best way to accomplish this task? I have been searching for answers on stack overflow but haven't found something that gives me a specific answer I am looking for. I need the new column to contain every bigram found from my bigram list.

Any help would be appreciated!

The output what I have below is what I am looking for, although on my real example, I have used stop words so exact bigrams aren't found like the output below. Is there a way to do with with some sort of string contains maybe?

import pandas as pd 
data = [['help me with my python pandas please'], ['machine learning is fun using svd with sklearn']] 
# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Message']) 
import numpy as np
bigrams =[('python', 'pandas'),
 ('function', 'input'),
 ('help', 'jupyter'),
 ('sklearn', 'svd')]
def matcher(x):
    for i in bigrams:
        if i.lower() in x.lower():
            return i
    else:
        return np.nan

df['Match'] = df['Message'].apply(matcher)
df

while this is an interesting problem, you should still include some sample data and expected output for that data. — Quang Hoang, Jun 14 '19 at 18:43
This is a sample code, I was asking for a sample data though. — Quang Hoang, Jun 14 '19 at 19:39
the only data I am working for this goal is the bigram list and my dataframe with a sentence on each line I want to iterate through. In this code it would be df['documents']. Ex: Each row contains a document with a sentence like, "help me with my python," or, "machine learning is fun." Does that make sense? — codingInMyBasement, Jun 14 '19 at 19:45
Certainly, even those two sentences would work. And you should specify what you want out of that sample data. — Quang Hoang, Jun 14 '19 at 19:46
The code in this question does not run for me, it gives: AttributeError: 'tuple' object has no attribute 'lower'. — zabop, Aug 20 '20 at 13:55

Quang Hoang · Answer 1 · 2020-08-20T14:07:41.253

3

This is what I would do:

# a sample, which you should've given
df = pd.DataFrame({'sentences': ['I like python pandas', 
                                 'find all function input from help jupyter',
                                 'this has no bigrams']})


# the bigrams
bigrams = [('python', 'pandas'),
 ('function', 'input'),
 ('help', 'jupyter'),
 ('sklearn', 'svd')]

# create one big regex pattern:
pat = '|'.join(" ".join(x) for x in bigrams)

new_df = df.sentences.str.findall(pat)

gives you

0                   [python pandas]
1    [function input, help jupyter]
2                                []
Name: sentences, dtype: object

Then you can choose to unnest the list in each row.

Or you can use get_dummies:

new_df.str.join(',').str.get_dummies(sep=',')

which gives you:

  function input  help jupyter  python pandas
0               0             0              1
1               1             1              0
2               0             0              0

edited Aug 20 '20 at 14:07

answered Jun 14 '19 at 19:56

Quang Hoang

146,074
10
56
74

because of the stop words I included this won't work on my original data. I need something that searches for each bigram individually. More like a str.contains, but for each bigram, and then gives me back each bigram that is contained. – codingInMyBasement Jun 14 '19 at 20:09
Yeah, this might not work if you bigrams overlap, like `(a,b)` and `(b,c)`. – Quang Hoang Jun 14 '19 at 20:11
Any other ideas? – codingInMyBasement Jun 14 '19 at 20:14
Actually I can make a duplicate column and then add the stop words to that one, then run your code over that column instead maybe! Going to give it a try. – codingInMyBasement Jun 14 '19 at 20:32
What is new_df? – zabop Aug 20 '20 at 13:54
@zabop The edit messed up my post somehow. Updated, hopefully it's clearer. – Quang Hoang Aug 20 '20 at 14:08

score 1 · Answer 2 · answered Jun 15 '19 at 10:51

Well, here's my solution featuring bigram terms detection in cleaned utterances (sentences).

It can easily be generalized to n-grams as well. It also takes into account stop words.

You can tune:

target_depth (default 2 for bigrams) if you want to look for other type of n-grams.
the default separator (space) used to tokenize words in sentence.
your set of stop_words (using nltk here for english common stops).

Please note that this implementation is recursive.

import pandas as pd 
import re
from nltk.corpus import stopwords

data = [
    ['help me with my python pandas please'],
    ['machine learning is fun using svd with sklearn'],
    ['please use |svd| with sklearn, get help on JupyteR!']
]
# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Message']) 

bigrams =[
    ('python', 'pandas'),
    ('function', 'input'),
    ('help', 'jupyter'),
    ('svd', 'sklearn')
]

stop_words = set(stopwords.words('english'))
sep = ' '

def _cleanup_token(w):
    """ Cleanup a token by stripping special chars """
    return re.sub('[^A-Za-z0-9]+', '', w)

def _preprocessed_tokens(x):
    """ Preprocess a sentence. """
    return list(map(lambda w: _cleanup_token(w), x.lower().split(sep)))

def _match_bg_term_in_sentence(bg, x, depth, target_depth=2):
    """ """
    if depth == target_depth:
        return True # the whole bigram was matched

    term = bg[depth]
    term = term.lower()
    pp_tokens = _preprocessed_tokens(x)

    if term in pp_tokens:
        bg_idx = pp_tokens.index(term)
        if depth > 0 and any([token not in stop_words for token in pp_tokens[0:bg_idx]]):
            return False # no bigram detected
        x = sep.join(pp_tokens[bg_idx+1:])
        return _match_bg_term_in_sentence(bg, x, depth+1, target_depth=target_depth)
    else:
        return False

def matcher(x):
    """ Return list of bigrams matched in sentence x """
    depth = 0 # current depth
    matchs = []
    for bg in bigrams:
        bg_idx = 0 # first term
        bg_matchs = _match_bg_term_in_sentence(bg, x, depth, target_depth=2)
        if bg_matchs is True:
            matchs.append(bg)
    return matchs

df['Match'] = df['Message'].apply(matcher)
print(df.head())

We actually obtain these results:

                               Match  
0                 [(python, pandas)]  
1                   [(svd, sklearn)]  
2  [(help, jupyter), (svd, sklearn)]

Hope this helps !

score 1 · Answer 3 · answered Jun 15 '19 at 12:16

flashtext can also be used to solve this problem

import pandas as pd
from flashtext import KeywordProcessor
from nltk.corpus import stopwords

stop = stopwords.words('english')
bigram_token = ['python pandas','function input', 'help jupyter','svd sklearn']

data = [['help me with my python pandas please'], ['machine learning is fun using svd 
with sklearn']] 

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Message']) 

kp = KeywordProcessor()
kp.add_keywords_from_list(bigram_token)

def bigram_finder(x, stop, kp):
    token = x.split()
    sent = ' '.join([x for x in token if x not in stop])
    return kp.extract_keywords(sent)

df['bigram_token'] = df['Message'].apply(lambda x : bigram_finder(x, stop, kp))
#ouptput
 0    [python pandas]
 1      [svd sklearn]
 Name: bigram_token, dtype: object

Bigram Finder for Pandas Dataframe

3 Answers3