0

I want to analyze a text file that holds a short story. Now I want to analyze it to make different types of graphs. I found plenty of ways to read a text file holding data but not actual words. Now I know I can do something like this:

f = open('short_story.txt')
for line in f:
    for word in line.split():

To count the words in the file. But is that the appropriate way to do it when I am using numpy and matplotlib. If anyone could explain how to use a text file of words, not data numbers, that would be great.

************

"Radio for warships, eh?" he muttered. A wireless transmitter was one of many modern innovations that the Virginia did not boast. She had been gathering copra and shell among the islands long before such things came into common use, though Dan had invested his modest savings in her only a year before.

"What would anyone want with warships on Davis Island?" The name roused a vague memory. "Davis Island?" he repeated, staring in concentration at the black sea. "Of course!" It came to him suddenly. A newspaper article that he had read five years before, at about the time he had abandoned college in the middle of his junior year, to follow the call of adventure.

The account had dealt with an eclipse of the sun, visible only from certain points on the Pacific. One Dr. Hunter, under the auspices of a Western university, had sailed with his instruments and assistants to Davis Island, to study the solar corona during the few precious moments when the shadow covered the sun, and to observe the displacement of certain stars as a test of Einstein's theory of relativity.`

    f = open('story.txt','r')
words = [x for y in [l.split() for l in f.readlines()] for x in y]
print sorted([(w, words.count(w)) for w in set(words)], key = lambda x:x[1], reverse=True)[:5]

Found top five words. now i want to plot it in something like a bar graph, I got these top five words...

[('the', 4826), ('of', 2276), ('and', 1825), ('a', 1761), ('to', 1693)]

2 Answers2

0

Can you copy part from your file ?

You could make something like that :

#!/usr/bin/python

file=open("txtfile","r+")
wordcount={}
for word in file.read().split():
    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] += 1
print (word,wordcount)
file.close();

And you will obtain something like this :

>>> goat {'goat': 2, 'cow': 1, 'Dog': 1, 'lion': 1, 'snake': 1}

Other way :

#!/usr/bin/python

file=open("txtfile","r+")
wordcount={}
for word in file.read().split():
    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] += 1
for k,v in wordcount.items():
    print k, v

And you will get :

word  wordcount
goat    2
cow     1
dog     1.....
Essex
  • 6,042
  • 11
  • 67
  • 139
  • Idk if you meant what iahve so far in my program or a piece from the text file. Heres a piece from the text file, It is part of a story, just words.^^^ – LizardWizard Apr 27 '16 at 00:44
  • @LizardWizard You want to make graph of what ? I don't understand. – Essex Apr 27 '16 at 00:48
  • I want to analyze the data in different ways. For example I want to count the top 5 words used and then use that data found to make a graph. – LizardWizard Apr 27 '16 at 00:53
  • I put 3 paragraphs from the file in the original question. Okay so with that should I convert that data to an array, or maybe two seperate arrays, one holding the word oblects and the other holding the numbers? then use matplotlib to create a plot with the numbers on the y axis and the words on the axis? Matplotlib is what really confuses me the most honestly. – LizardWizard Apr 27 '16 at 01:07
  • I edited more into the original question. I am sorry if I am not very clear. but matplotlib is the main issue. I cant find much help with ploting data like this from a text file. I mostly find ploting data(numbers) from a text file or using pandas with a CSV file but nothing on reading a text file like this and plotting the information found. Hopefully you might be able to help. – LizardWizard Apr 27 '16 at 01:14
0

You can get frequencies for each element from numpy array in this way:

import numpy as np
x = np.array([1,1,1,2,2,2,5,25,1,1])
y = np.bincount(x)
ii = np.nonzero(y)[0]
np.vstack((ii,y[ii])).T
# array([[ 1,  5],
         [ 2,  3],
         [ 5,  1],
         [25,  1]])

For better word splitting you can use some NLP tools, i.e. word tokenizer from NLTK.

Similar question about using matplotlib for visualization of such data.

Community
  • 1
  • 1