4

I have written a code that calculates bigram / trigram frequency from a text input, using NLTK. The problem that I am facing here is that since the output is obtained in form of a Python List, my output contains list specific characters i.e. ("()", "'",","). I plan to export this into a csv file, and thus I would want to remove these special characters at the code level itself. How can I make that edit.

Input Code:

import nltk
from nltk import word_tokenize, pos_tag
from nltk.collocations import *
from itertools import *
from nltk.util import ngrams
from nltk.corpus import stopwords

corpus = '''The pure amnesia of her face,
newborn. I looked so far into her that, for a while, looked so far into her that, for a while  looked so far into her that, for a while looked so far into her that, for a while the visual 
held no memory. Little by little, I returned to myself, waking to nurse the visual held no  memory. Little by little, I returned to myself, waking to nurse
'''
s_corpus = corpus.lower()

stop_words = set(stopwords.words('english'))

tokens = nltk.word_tokenize(s_corpus)
tokens = [word for word in tokens if word not in stop_words]

c_tokens = [''.join(e for e in string if e.isalnum()) for string in tokens]
c_tokens = [x for x in c_tokens if x]

bgs_2 = nltk.bigrams(c_tokens)
bgs_3 = nltk.trigrams(c_tokens)

fdist = nltk.FreqDist(bgs_3)

tmp = list()
for k,v in fdist.items():
    tmp.append((v,k))
tmp = sorted (tmp, reverse=True)

for kk,vv in tmp[:]:
    print (vv,kk)

Current Output:

('looked', 'far', 'looked') 3
('far', 'looked', 'far') 3
('visual', 'held', 'memory') 2
('returned', 'waking', 'nurse') 2

Expected Output:

looked far looked, 3
far looked far, 3
visual held memory, 2
returned waking nurse, 2

Thanks for your help in advance.

Ayush Saxena
  • 105
  • 2
  • 12
  • 4
    Those special characters are not actually part of the list. They are just formatted that way when you use the `print()` command. The values contained in the list are just the words you want (no `(` or `,` or `'` in them) – Karl Aug 30 '18 at 14:24
  • " I plan to export this into a csv file," - just do that and it'll be fine – Thomas Weller Aug 30 '18 at 14:27
  • Even I had exported them into a CSV format, the special characters still remained & I had to remove them manually in Excel; any way to avoid having them ? – Ayush Saxena Aug 30 '18 at 14:33
  • 1
    @AyushSaxena if that is the case you should show how you are exporting the values to CSV or create a question related to that because the error is probably occuring in the way you are saving it to the CSV file. – Karl Aug 30 '18 at 15:11

2 Answers2

6

A better question would have been what are those ("()", "'",",") in the ngrams output?

>>> from nltk import ngrams
>>> from nltk import word_tokenize

# Split a sentence into a list of "words"
>>> word_tokenize("This is a foo bar sentence")
['This', 'is', 'a', 'foo', 'bar', 'sentence']
>>> type(word_tokenize("This is a foo bar sentence"))
<class 'list'>

# Extract bigrams.
>>> list(ngrams(word_tokenize("This is a foo bar sentence"), 2))
[('This', 'is'), ('is', 'a'), ('a', 'foo'), ('foo', 'bar'), ('bar', 'sentence')]

# Okay, so the output is a list, no surprise.
>>> type(list(ngrams(word_tokenize("This is a foo bar sentence"), 2)))
<class 'list'>

But what type is ('This', 'is')?

>>> list(ngrams(word_tokenize("This is a foo bar sentence"), 2))[0]
('This', 'is')
>>> first_thing_in_output = list(ngrams(word_tokenize("This is a foo bar sentence"), 2))[0]
>>> type(first_thing_in_output)
<class 'tuple'>

Ah, it's a tuple, see https://realpython.com/python-lists-tuples/

What happens when you print a tuple?

>>> print(first_thing_in_output)
('This', 'is')

What happens if you convert them into a str()?

>>> print(str(first_thing_in_output))
('This', 'is')

But I want the output This is instead of ('This', 'is'), so I will use the str.join() function, see https://www.geeksforgeeks.org/join-function-python/:

>>> print(' '.join((first_thing_in_output)))
This is

Now this is a good point to really go through the tutorial of basic Python types to understand what is happening. Additionally, it'll be good to understand how "container" types work too, e.g. https://github.com/usaarhat/pywarmups/blob/master/session2.md


Going through the original post, there are quite some issues with the code.

I guess the goal of the code is to:

  • Tokenize the text and remove stopwords
  • Extract ngrams (without stopwords)
  • Print out their string forms and their counts

The tricky part is the stopwords.words('english') does not contain punctuation, so you'll end up with strange ngrams that contains punctuations:

from nltk import word_tokenize
from nltk.util import ngrams
from nltk.corpus import stopwords

text = '''The pure amnesia of her face,
newborn. I looked so far into her that, for a while, looked so far into her that, for a while  looked so far into her that, for a while looked so far into her that, for a while the visual 
held no memory. Little by little, I returned to myself, waking to nurse the visual held no  memory. Little by little, I returned to myself, waking to nurse
'''

stoplist = set(stopwords.words('english'))

tokens = [token for token in nltk.word_tokenize(text) if token not in stoplist]

list(ngrams(tokens, 2))

[out]:

[('The', 'pure'),
 ('pure', 'amnesia'),
 ('amnesia', 'face'),
 ('face', ','),
 (',', 'newborn'),
 ('newborn', '.'),
 ('.', 'I'),
 ('I', 'looked'),
 ('looked', 'far'),
 ('far', ','),
 (',', ','), ...]

Perhaps you would like to extend the stoplist with punctuations, e.g.

from string import punctuation
from nltk import word_tokenize
from nltk.util import ngrams
from nltk.corpus import stopwords

text = '''The pure amnesia of her face,
newborn. I looked so far into her that, for a while, looked so far into her that, for a while  looked so far into her that, for a while looked so far into her that, for a while the visual 
held no memory. Little by little, I returned to myself, waking to nurse the visual held no  memory. Little by little, I returned to myself, waking to nurse
'''

stoplist = set(stopwords.words('english') + list(punctuation))

tokens = [token for token in nltk.word_tokenize(text) if token not in stoplist]

list(ngrams(tokens, 2))

[out]:

[('The', 'pure'),
 ('pure', 'amnesia'),
 ('amnesia', 'face'),
 ('face', 'newborn'),
 ('newborn', 'I'),
 ('I', 'looked'),
 ('looked', 'far'),
 ('far', 'looked'),
 ('looked', 'far'), ...]

Then you realized that tokens like I should be a stopword but still exists in your list of ngrams. It's because the list from stopwords.words('english') are lowercased, e.g.

>>> stopwords.words('english')

[out]:

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're", ...]

So when you're checking whether a token is in the stoplist, you should also lowercase the token. (Avoid lowercasing the sentence before word_tokenize because word_tokenize may take cues from capitalization). Thus:

from string import punctuation
from nltk import word_tokenize
from nltk.util import ngrams
from nltk.corpus import stopwords

text = '''The pure amnesia of her face,
newborn. I looked so far into her that, for a while, looked so far into her that, for a while  looked so far into her that, for a while looked so far into her that, for a while the visual 
held no memory. Little by little, I returned to myself, waking to nurse the visual held no  memory. Little by little, I returned to myself, waking to nurse
'''

stoplist = set(stopwords.words('english') + list(punctuation))

tokens = [token for token in nltk.word_tokenize(text) if token.lower() not in stoplist]

list(ngrams(tokens, 2))

[out]:

[('pure', 'amnesia'),
 ('amnesia', 'face'),
 ('face', 'newborn'),
 ('newborn', 'looked'),
 ('looked', 'far'),
 ('far', 'looked'),
 ('looked', 'far'),
 ('far', 'looked'),
 ('looked', 'far'),
 ('far', 'looked'), ...]

Now the ngrams looks like it's achieving the objectives:

  • Tokenize the text and remove stopwords
  • Extract ngrams (without stopwords)

Then on the last part where you want to print out the ngrams to a file in sorted order, you could actually use the Freqdist.most_common() which will list in descending order, e.g.

from string import punctuation
from nltk import word_tokenize
from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk import FreqDist

text = '''The pure amnesia of her face,
newborn. I looked so far into her that, for a while, looked so far into her that, for a while  looked so far into her that, for a while looked so far into her that, for a while the visual 
held no memory. Little by little, I returned to myself, waking to nurse the visual held no  memory. Little by little, I returned to myself, waking to nurse
'''

stoplist = set(stopwords.words('english') + list(punctuation))

tokens = [token for token in nltk.word_tokenize(text) if token.lower() not in stoplist]

FreqDist(ngrams(tokens, 2)).most_common()

[out]:

[(('looked', 'far'), 4),
 (('far', 'looked'), 3),
 (('visual', 'held'), 2),
 (('held', 'memory'), 2),
 (('memory', 'Little'), 2),
 (('Little', 'little'), 2),
 (('little', 'returned'), 2),
 (('returned', 'waking'), 2),
 (('waking', 'nurse'), 2),
 (('pure', 'amnesia'), 1),
 (('amnesia', 'face'), 1),
 (('face', 'newborn'), 1),
 (('newborn', 'looked'), 1),
 (('far', 'visual'), 1),
 (('nurse', 'visual'), 1)]

(See also: Difference between Python's collections.Counter and nltk.probability.FreqDist)

Final finally, printing it out to file, you should really use a context manager, http://eigenhombre.com/introduction-to-context-managers-in-python.html

with open('bigrams-list.tsv', 'w') as fout:
    for bg, count in FreqDist(ngrams(tokens, 2)).most_common():
        print('\t'.join([' '.join(bg), str(count)]), end='\n', file=fout)

[bigrams-list.tsv]:

looked far  4
far looked  3
visual held 2
held memory 2
memory Little   2
Little little   2
little returned 2
returned waking 2
waking nurse    2
pure amnesia    1
amnesia face    1
face newborn    1
newborn looked  1
far visual  1
nurse visual    1

Food for thought

Now you see this strange bigram Little little, does it make sense?

It's a by-product of removing by from

Little by little

So now, depending on what's the ultimate task for the ngrams you've extracted, you might not really want to remove stopwords from the list.

alvas
  • 115,346
  • 109
  • 446
  • 738
  • 1
    Excellent response ! Lot of my questions that I had in my mind got cleared with this one detailed response. Thanks ! – Casey Jun 19 '23 at 22:47
0

So just to "fix" your output: Use this to print your data:

for kk,vv in tmp:
    print(" ".join(list(kk)),",%d" % vv)

BUT if you are going to parse this into an csv you should collect your output in a different format.

Currently you are creating a list of tupels containing a tupel and a number. try to collect your data as a list of lists containing each value. That way you can just write it directly into an csv file.

Take a look here: Create a .csv file with values from a Python list

Phillip
  • 789
  • 4
  • 22