How can I look for specific bigrams in text example - python?

Question

I am interested in finding how often (in percentage) a set of words, as in n_grams appears in a sentence.

example_txt= ["order intake is strong for Q4"]

def find_ngrams(text):
    text = re.findall('[A-z]+', text)
    content = [w for w in text if w.lower() in n_grams] # you can calculate %stopwords using "in"
    return round(float(len(content)) / float(len(text)), 5)

#the goal is for the above procedure to work on a pandas datafame, but for now lets use 'text' as an example.
#full_MD['n_grams'] = [find_ngrams(x) for x in list(full_MD.loc[:,'text_no_stopwords'])]

Below you see two examples. The first one works, the last doesn't.

n_grams= ['order']
res = [find_ngrams(x) for x in list(example_txt)]
print(res)
Output:
[0.16667]

n_grams= ['order intake']
res = [find_ngrams(x) for x in list(example_txt)]
print(res)
Output:
[0.0]

How can I make the find_ngrams() function process bigrams, so the last example from above works?

Edit: Any other ideas?

that is the word/words i am interested in finding the precentage for how often it is mentioned in example_txt. — doomdaam, Apr 08 '20 at 10:10

score 2 · Accepted Answer · answered Apr 10 '20 at 14:50

You can use SpaCy Matcher:

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# Add match ID "orderintake" with no callback and one pattern
pattern = [{"LOWER": "order"}, {"LOWER": "intake"}]
matcher.add("orderintake", None, pattern)

doc = nlp("order intake is strong for Q4")
matches = matcher(doc)
print(len(matches)) #Number of times the bi-gram appears in text

Matheus Torquato · Answer 2 · 2020-04-08T10:26:52.123

0

The line

re.findall('[A-z]+', text)

returns

['order', 'intake', 'is', 'strong', 'for', 'Q'].

For this reason, the string 'order intake' will not be matched in your for here:

content = [w for w in text if w.lower() in n_grams]

If you want it to match, you'll need to make one single of string from each Bigram.

Instead, you should probably use this to find Bigrams.

For N-grams, have a look at this answer.

edited Apr 08 '20 at 10:26

answered Apr 08 '20 at 10:21

Matheus Torquato

1,293
18
25

score 0 · Answer 3 · answered Apr 08 '20 at 10:22

0

maybe you have already exploited this option, but why not use the a simple .count combined with len:

(example_txt[0].count(n_grams[0]) * len(n_grams[0])) / len(example_txt[0])

or if you are not interested in the spaces as part of your calculation you can use the following:

(example_txt[0].count(n_grams[0])* len(n_grams[0])) / len(example_txt[0].replace(' ',''))

of course you can use them in a list comprehension, this was just for demonstration purposes

answered Apr 08 '20 at 10:22

emiljoj

399
1
7

Interesting approach. I guess what it has to be modified to look for whole sentences as it right now looks at characters. – doomdaam Apr 08 '20 at 11:08
right now it's finding substrings within a given string. The [0] are only there because the sentence and string examples you took are placed in a list – emiljoj Apr 08 '20 at 12:45
I understand. The result of your example code is 50%, which only makes sense if we count characters. How can we make it look for words? – doomdaam Apr 08 '20 at 12:54

How can I look for specific bigrams in text example - python?

3 Answers3