I have to perform Stemming on a text. The questions are as follows :
- Tokenize all the words given in
tc
. The word should contain alphabets or numbers or underscore. Store the tokenized list of words intw
- Convert all the words into lowercase. Store the result into the variable
tw
- Remove all the stop words from the unique set of
tw
. Store the result into the variablefw
- Stem each word present in
fw
with PorterStemmer, and store the result in the listpsw
Below is my code :
import re
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer,LancasterStemmer
pattern = r'\w+';
tw= nltk.regexp_tokenize(tc,pattern);
tw= [word.lower() for word in tw];
stop_word = set(stopwords.words('english'));
fw= [w for w in tw if not w in stop_word];
#print(sorted(filteredwords));
porter = PorterStemmer();
psw = [porter.stem(word) for word in fw];
print(sorted(psw));
My code works perfectly with all the provided testcases in hand-on but it fails only for the below test case where
tc = "I inadvertently went to See's Candy last week (I was in the mall looking for phone repair), and as it turns out, See's Candy now charges a dollar -- a full dollar -- for even the simplest of their wee confection offerings. I bought two chocolate lollipops and two chocolate-caramel-almond things. The total cost was four-something. I mean, the candies were tasty and all, but let's be real: A Snickers bar is fifty cents. After this dollar-per-candy revelation, I may not find myself wandering dreamily back into a See's Candy any time soon."
My Output is :
['almond', 'back', 'bar', 'bought', 'candi', 'candi', 'caramel', 'cent', 'charg', 'chocol', 'confect', 'cost', 'dollar', 'dreamili', 'even', 'fifti', 'find', 'four', 'full', 'inadvert', 'last', 'let', 'lollipop', 'look', 'mall', 'may', 'mean', 'offer', 'per', 'phone', 'real', 'repair', 'revel', 'see', 'simplest', 'snicker', 'someth', 'soon', 'tasti', 'thing', 'time', 'total', 'turn', 'two', 'wander', 'wee', 'week', 'went']
Expected Output is :
['almond', 'back', 'bar', 'bought', 'candi', 'candi', 'candi', 'caramel', 'cent', 'charg', 'chocol', 'confect', 'cost', 'dollar', 'dreamili', 'even', 'fifti', 'find', 'four', 'full', 'inadvert', 'last', 'let', 'lollipop', 'look', 'mall', 'may', 'mean', 'offer', 'per', 'phone', 'real', 'repair', 'revel', 'see', 'simplest', 'snicker', 'someth', 'soon', 'tasti', 'thing', 'time', 'total', 'turn', 'two', 'wander', 'wee', 'week', 'went']
The difference is the occurrence of 'Candi'
Looking help to troubleshoot the issue.