Best way to understand the input text before applying ngram

Question

Currently I am reading text from excel file and applying bigram to it. finalList has list used in below sample code has the list of input words read from input excel file.

Removed the stopwords from input with help of following library:

from nltk.corpus import stopwords

bigram logic applied on list of input text of words

bigram=ngrams(finalList ,2)

input text: I completed my end-to-end process.

Current output: Completed end, end end, end process.

Desired output: completed end-to-end, end-to-end process.

That means some group of words like (end-to-end) should be considered as 1 word.

use proper tokenizer: http://nlp.cogcomp.org/ – Daniel Oct 09 '17 at 15:57 — Daniel, Oct 09 '17 at 15:57

score 1 · Accepted Answer · answered Oct 12 '17 at 22:43

To solve your problem, you have to clean the stop words using regex. See this example:

 import re
 text = 'I completed my end-to-end process..:?' 
 pattern = re.compile(r"\.*:\?*") # to remove zero or more instances of such stop words, the hyphen is not included in the stop words. 
 new_text = re.sub(pattern, '', text)
 print(new_text)
 'I completed my end-to-end process'


 # Now you can generate bigrams manually.
 # 1. Tokanize the new text
 tok = new_text.split()
 print(tok) # If the size of token is huge, just print the first five ones, like this print(tok[:5])
 ['I', 'completed', 'my', 'end-to-end', 'process']

 # 2. Loop over the list and generate bigrams, store them in a var called bigrams
 bigrams = []
 for i in range(len(tok) - 1):  # -1 to avoid index error
     bigram = tok[i] + ' ' + tok[i + 1]  
     bigrams.append(bigram)


 # 3. Print your bigrams
 for bi in bigrams:
     print(bi, end = ', ')

I completed, completed my, my end-to-end, end-to-end process,

I hope this helps!

Best way to understand the input text before applying ngram

1 Answers1