A better question would have been what are those ("()", "'",",")
in the ngrams output?
>>> from nltk import ngrams
>>> from nltk import word_tokenize
# Split a sentence into a list of "words"
>>> word_tokenize("This is a foo bar sentence")
['This', 'is', 'a', 'foo', 'bar', 'sentence']
>>> type(word_tokenize("This is a foo bar sentence"))
<class 'list'>
# Extract bigrams.
>>> list(ngrams(word_tokenize("This is a foo bar sentence"), 2))
[('This', 'is'), ('is', 'a'), ('a', 'foo'), ('foo', 'bar'), ('bar', 'sentence')]
# Okay, so the output is a list, no surprise.
>>> type(list(ngrams(word_tokenize("This is a foo bar sentence"), 2)))
<class 'list'>
But what type is ('This', 'is')
?
>>> list(ngrams(word_tokenize("This is a foo bar sentence"), 2))[0]
('This', 'is')
>>> first_thing_in_output = list(ngrams(word_tokenize("This is a foo bar sentence"), 2))[0]
>>> type(first_thing_in_output)
<class 'tuple'>
Ah, it's a tuple, see https://realpython.com/python-lists-tuples/
What happens when you print a tuple?
>>> print(first_thing_in_output)
('This', 'is')
What happens if you convert them into a str()
?
>>> print(str(first_thing_in_output))
('This', 'is')
But I want the output This is
instead of ('This', 'is')
, so I will use the str.join()
function, see https://www.geeksforgeeks.org/join-function-python/:
>>> print(' '.join((first_thing_in_output)))
This is
Now this is a good point to really go through the tutorial of basic Python types to understand what is happening. Additionally, it'll be good to understand how "container" types work too, e.g. https://github.com/usaarhat/pywarmups/blob/master/session2.md
Going through the original post, there are quite some issues with the code.
I guess the goal of the code is to:
- Tokenize the text and remove stopwords
- Extract ngrams (without stopwords)
- Print out their string forms and their counts
The tricky part is the stopwords.words('english')
does not contain punctuation, so you'll end up with strange ngrams that contains punctuations:
from nltk import word_tokenize
from nltk.util import ngrams
from nltk.corpus import stopwords
text = '''The pure amnesia of her face,
newborn. I looked so far into her that, for a while, looked so far into her that, for a while looked so far into her that, for a while looked so far into her that, for a while the visual
held no memory. Little by little, I returned to myself, waking to nurse the visual held no memory. Little by little, I returned to myself, waking to nurse
'''
stoplist = set(stopwords.words('english'))
tokens = [token for token in nltk.word_tokenize(text) if token not in stoplist]
list(ngrams(tokens, 2))
[out]:
[('The', 'pure'),
('pure', 'amnesia'),
('amnesia', 'face'),
('face', ','),
(',', 'newborn'),
('newborn', '.'),
('.', 'I'),
('I', 'looked'),
('looked', 'far'),
('far', ','),
(',', ','), ...]
Perhaps you would like to extend the stoplist with punctuations, e.g.
from string import punctuation
from nltk import word_tokenize
from nltk.util import ngrams
from nltk.corpus import stopwords
text = '''The pure amnesia of her face,
newborn. I looked so far into her that, for a while, looked so far into her that, for a while looked so far into her that, for a while looked so far into her that, for a while the visual
held no memory. Little by little, I returned to myself, waking to nurse the visual held no memory. Little by little, I returned to myself, waking to nurse
'''
stoplist = set(stopwords.words('english') + list(punctuation))
tokens = [token for token in nltk.word_tokenize(text) if token not in stoplist]
list(ngrams(tokens, 2))
[out]:
[('The', 'pure'),
('pure', 'amnesia'),
('amnesia', 'face'),
('face', 'newborn'),
('newborn', 'I'),
('I', 'looked'),
('looked', 'far'),
('far', 'looked'),
('looked', 'far'), ...]
Then you realized that tokens like I
should be a stopword but still exists in your list of ngrams. It's because the list from stopwords.words('english')
are lowercased, e.g.
>>> stopwords.words('english')
[out]:
['i',
'me',
'my',
'myself',
'we',
'our',
'ours',
'ourselves',
'you',
"you're", ...]
So when you're checking whether a token is in the stoplist, you should also lowercase the token. (Avoid lowercasing the sentence before word_tokenize
because word_tokenize
may take cues from capitalization). Thus:
from string import punctuation
from nltk import word_tokenize
from nltk.util import ngrams
from nltk.corpus import stopwords
text = '''The pure amnesia of her face,
newborn. I looked so far into her that, for a while, looked so far into her that, for a while looked so far into her that, for a while looked so far into her that, for a while the visual
held no memory. Little by little, I returned to myself, waking to nurse the visual held no memory. Little by little, I returned to myself, waking to nurse
'''
stoplist = set(stopwords.words('english') + list(punctuation))
tokens = [token for token in nltk.word_tokenize(text) if token.lower() not in stoplist]
list(ngrams(tokens, 2))
[out]:
[('pure', 'amnesia'),
('amnesia', 'face'),
('face', 'newborn'),
('newborn', 'looked'),
('looked', 'far'),
('far', 'looked'),
('looked', 'far'),
('far', 'looked'),
('looked', 'far'),
('far', 'looked'), ...]
Now the ngrams looks like it's achieving the objectives:
- Tokenize the text and remove stopwords
- Extract ngrams (without stopwords)
Then on the last part where you want to print out the ngrams to a file in sorted order, you could actually use the Freqdist.most_common()
which will list in descending order, e.g.
from string import punctuation
from nltk import word_tokenize
from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk import FreqDist
text = '''The pure amnesia of her face,
newborn. I looked so far into her that, for a while, looked so far into her that, for a while looked so far into her that, for a while looked so far into her that, for a while the visual
held no memory. Little by little, I returned to myself, waking to nurse the visual held no memory. Little by little, I returned to myself, waking to nurse
'''
stoplist = set(stopwords.words('english') + list(punctuation))
tokens = [token for token in nltk.word_tokenize(text) if token.lower() not in stoplist]
FreqDist(ngrams(tokens, 2)).most_common()
[out]:
[(('looked', 'far'), 4),
(('far', 'looked'), 3),
(('visual', 'held'), 2),
(('held', 'memory'), 2),
(('memory', 'Little'), 2),
(('Little', 'little'), 2),
(('little', 'returned'), 2),
(('returned', 'waking'), 2),
(('waking', 'nurse'), 2),
(('pure', 'amnesia'), 1),
(('amnesia', 'face'), 1),
(('face', 'newborn'), 1),
(('newborn', 'looked'), 1),
(('far', 'visual'), 1),
(('nurse', 'visual'), 1)]
(See also: Difference between Python's collections.Counter and nltk.probability.FreqDist)
Final finally, printing it out to file, you should really use a context manager, http://eigenhombre.com/introduction-to-context-managers-in-python.html
with open('bigrams-list.tsv', 'w') as fout:
for bg, count in FreqDist(ngrams(tokens, 2)).most_common():
print('\t'.join([' '.join(bg), str(count)]), end='\n', file=fout)
[bigrams-list.tsv]:
looked far 4
far looked 3
visual held 2
held memory 2
memory Little 2
Little little 2
little returned 2
returned waking 2
waking nurse 2
pure amnesia 1
amnesia face 1
face newborn 1
newborn looked 1
far visual 1
nurse visual 1
Food for thought
Now you see this strange bigram Little little
, does it make sense?
It's a by-product of removing by
from
Little by little
So now, depending on what's the ultimate task for the ngrams you've extracted, you might not really want to remove stopwords from the list.