Word and Phrase Frequencies from txt Files in Python

Question

I am in the middle of some textual analysis. Basically, I am trying to get the total word counts (based on a list of words) and the total phrase counts (based on a list of phrases) for each file in a certain folder. So far, I have the following. But I keep getting errors 'str' object has no attribute 'words'. The code I have tried to write is a combination of several other codes, so I don't know which part is creating the issue. Any help would be appreciated.

import csv
import glob
import re
import string
import sys
import time

target_files = r'C:/Users/Mansoor/Documents/Files/*.*'

output_file = r'C:/Users/Mansoor/Documents/Parser.csv'

output_fields = ['file name,', 'file size,', 'words,', 'phrases,']

words = {'uncertainty', 'downturn', 'shock'}
phrases = {'economic downturn', 'political uncertainty'}

def main():

    f_out = open(output_file, 'w')
    wr = csv.writer(f_out, lineterminator='\n')
    wr.writerow(output_fields)

    file_list = glob.glob(target_files)
    for file in file_list:
        print(file)
        with open(file, 'r', encoding='UTF-8', errors='ignore') as f_in:
            doc = f_in.read()
        doc_len = len(doc)
        doc = doc.lower()
        output_data = get_data(doc)
        output_data[0] = file
        output_data[1] = doc_len
        wr.writerow(output_data)

def get_data(doc):

    vdictionary = {}
    _odata = [0] * 4

    tokens = re.findall('\w(?:[-\w]*\w)?', doc)
    for token in tokens:
        if token not in vdictionary:
            vdictionary[token] = 1
        if token.words: _odata[2] += 1
    for w1, w2 in zip(phrases, phrases[1:]):
        phrase = w1 + " " + w2
        if phrase.phrases: _odata[3] += 1
    return _odata

if __name__ == '__main__':
    print('\n' + time.strftime('%c') + '\nUncertainty.py\n')
    main()
    print('\n' + time.strftime('%c') + '\nNormal termination.')

The error message usually tells you where the error was noted, and you can work backwards from there. The issue is probably `token.words` in line 10 from the bottom: What is that supposed to be? Why would it evaluate to a Boolean (being used in the condition)? — Arne, May 14 '20 at 14:17
Thank you Arne. I am trying to split all the words into tokens and then calculate the frequency based on the list of words provided above. I am trying to see if the word or phrase is in the list and then adding 1 to the relevant field in the .csv file every time there is a match. — Mansoor, May 14 '20 at 14:42
Maybe this can help, in particular note the answer given by Tim McNamara: https://stackoverflow.com/questions/6181763/converting-a-string-to-a-list-of-words — Arne, May 14 '20 at 18:47
I don't think that is what I am looking for. His comment relates to adding words to a dictionary but I simply want the number of times each word in the list of words occurs in the text files. — Mansoor, May 14 '20 at 21:46

score 0 · Answer 1 · answered May 14 '20 at 14:14

0

The error is in line if token.words: _odata[2] += 1 most probably the error is because token is not of type dict of some data structure with support properties

for token in tokens:
    print(token) # print token here to see the what is the value of token
    if token not in vdictionary:
        vdictionary[token] = 1
    if token.words: _odata[2] += 1

answered May 14 '20 at 14:14

Abhishek-Saini

733
7
11

`print(token)` does not result in anything. I am trying to split all the words into tokens and then calculate the frequency based on the list of words provided above. – Mansoor May 14 '20 at 14:37

score 0 · Accepted Answer · answered May 14 '20 at 22:08

So I solved this myself. Here is the code.

import csv
import glob
import re
import string
import sys
import time

target_files = r'C:/Users/Mansoor/Documents/Files/*.*'

output_file = r'C:/Users/Mansoor/Documents/Parser.csv'

output_fields = ['file name,', 'file size,', 'words,', 'phrases,']

words = {'uncertainty', 'downturn', 'shock'}
phrases = {'economic downturn', 'political uncertainty'}

def main():

    f_out = open(output_file, 'w')
    wr = csv.writer(f_out, lineterminator='\n')
    wr.writerow(output_fields)

    file_list = glob.glob(target_files)
    for file in file_list:
        print(file)
        with open(file, 'r', encoding='UTF-8', errors='ignore') as f_in:
            doc = f_in.read()
        doc_len = len(doc)
        doc = doc.lower()
        output_data = get_data(doc)
        output_data[0] = file
        output_data[1] = doc_len
        wr.writerow(output_data)

def get_data(doc):

    _odata = [0] * 4

    tokens = re.findall('\w(?:[-\w]*\w)?', doc)
    for token in tokens:
        if token in words:
            _odata[2] += 1
    for w1, w2 in zip(tokens, tokens[1:]):
        phrase = w1 + " " + w2
        if phrase in phrases:
            _odata[3] += 1
    return _odata

if __name__ == '__main__':
    print('\n' + time.strftime('%c') + '\nUncertainty.py\n')
    main()
    print('\n' + time.strftime('%c') + '\nNormal termination.')

Word and Phrase Frequencies from txt Files in Python

2 Answers2