Counting word frequency and writing in an output file

Question

From file_test.txt I need to count how many times each word appeared int the file using nltk.FreqDist() function. When I count word frequency I need to see if that word is in pos_dict.txt, and if it is, then multiply the number of word freq with the number standing by the same word in pos_dict.txt.

file_test.txt looks like this:

  abandon, abandon, calm, clear

pos_dict.txt looks like this for these words:

"abandon":2,"calm":2,"clear":1,...

My code is:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import nltk

f_input_pos=open('file_test.txt','r').read()

def features_pos(dat):
    tokens = nltk.word_tokenize(dat)
    fdist=nltk.FreqDist(tokens)

    f_pos_dict=open('pos_dict.txt','r').read()
    f=f_pos_dict.split(',') 

    for part in f:
        b=part.split(':')
        c=b[-1]   #to catch the number
        T2 = eval(str(c).replace("'","")) # convert number from string to int

        for word in fdist:
            if word in f_pos_dict:
               d=fdist[word]
               print(word,'->',d*T2)


features_pos(f_input_pos)

So my output needs to be like this:

abandon->4
calm->2
clear->1

But my output is duplicating all outputs and obvioulsy multiplying wrong. I'm a bit stuck and I don't know where is the error, probably I'm using for loops wrong. If somebody can help, I would appreciate it :)

How does your input file look like? Can you post a link or a sample of the `file_test.txt` and `pos_dict.txt`? — alvas, May 24 '16 at 01:32
My input file `file_test.txt` looks the same as I wrote in my question, in `pos_dict.txt` other words are included but are not important for understanding. — D.Fig, May 24 '16 at 11:15

score 1 · Answer 1 · edited May 23 '17 at 11:58

First, here's a quick way to read your pos_dict.txt file by reading it as a string representation of a dictionary:

alvas@ubi:~$ echo '"abandon":2,"calm":2,"clear":1' > pos_dict.txt
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> with io.open('pos_dict.txt', 'r') as fin:
...     pos_dict = eval("{" + fin.read() + "}")
... 
>>>
>>> pos_dict['abandon']
2
>>> pos_dict['clear']
1

Next, to read your file_test.txt, we have to read the file, strip the heading and trailing spaces then split the words up by , (comma followed by a space).

Then using the collections.Counter object we can get the token counts easily (also see Difference between Python's collections.Counter and nltk.probability.FreqDist):

alvas@ubi:~$ echo 'abandon, abandon, calm, clear' > file_test.txt
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> from collections import Counter
>>> with io.open('file_test.txt', 'r') as fin:
...     tokens = fin.read().strip().split(', ')
... 
>>> Counter(tokens)
Counter({u'abandon': 2, u'clear': 1, u'calm': 1})

To access the token counts from file_test.txt and multiply them with the values of the pos_dict.txt, we iterate through the Counter object using the .items() function (like how we can access a dictionary's key-value pairs):

>>> import io
>>> from collections import Counter
>>> with io.open('file_test.txt', 'r') as fin:
...     tokens = fin.read().strip().split(', ')
... 
>>> 
>>> word_counts = Counter(tokens)
>>> with io.open('pos_dict.txt', 'r') as fin:
...     pos_dict = eval("{" + fin.read() + "}")
... 
>>>
>>> token_times_posdict = {word:freq*pos_dict[word] for word, freq in Counter(tokens).items()}
>>> token_times_posdict
{u'abandon': 4, u'clear': 1, u'calm': 2}

And then to print them out:

>>> for word, value in token_times_posdict.items():
...     print "{} -> {}".format(word, value)
... 
abandon -> 4
clear -> 1
calm -> 2

I guess i have resolved your "homework" but please do understand the code and not just copy and paste them. — alvas, May 24 '16 at 13:00
BTW, also the `dict.items()` function would not sort the keys by its value, try looking at `Counter`'s function and you'll find that there's one function to do just what you'll be looking for (Hint: `from collections import Counter; dir(Counter)` =) — alvas, May 24 '16 at 13:48
Thank you for you effort, It is really a good solution. I understand the code. Since I'm new in language processing and nltk I didn't know about representing strings as dictionary, but now it's clear. Thank you — D.Fig, May 24 '16 at 19:24
I'm glad the answer helped. Have fun with Python and NLTK =) — alvas, May 25 '16 at 00:12
Maybe this would help you a little when it comes to understanding Python's container: https://github.com/usaarhat/pywarmups/blob/master/session2.md — alvas, May 25 '16 at 00:13

Counting word frequency and writing in an output file

1 Answers1