-2

I am reading a file and calculating the frequency of the top 100 words. I am able to find that and create the following list:

[('test', 510), ('Hey', 362), ("please", 753), ('take', 446), ('herbert', 325), ('live', 222), ('hate', 210), ('white', 191), ('simple', 175), ('harry', 172), ('woman', 170), ('basil', 153), ('things', 129), ('think', 126), ('bye', 124), ('thing', 120), ('love', 107), ('quite', 107), ('face', 107), ('eyes', 107), ('time', 106), ('himself', 105), ('want', 105), ('good', 105), ('really', 103), ('away',100), ('did', 100), ('people', 99), ('came', 97), ('say', 97), ('cried', 95), ('looked', 94), ('tell', 92), ('look', 91), ('world', 89), ('work', 89), ('project', 88), ('room', 88), ('going', 87), ('answered', 87), ('mr', 87), ('little', 87), ('yes', 84), ('silly', 82), ('thought', 82), ('shall', 81), ('circle', 80), ('hallward', 80), ('told', 77), ('feel', 76), ('great', 74), ('art', 74), ('dear',73), ('picture', 73), ('men', 72), ('long', 71), ('young', 70), ('lady', 69), ('let', 66), ('minute', 66), ('women', 66), ('soul', 65), ('door', 64), ('hand',63), ('went', 63), ('make', 63), ('night', 62), ('asked', 61), ('old', 61), ('passed', 60), ('afraid', 60), ('night', 59), ('looking', 58), ('wonderful', 58), ('gutenberg-tm', 56), ('beauty', 55), ('sir', 55), ('table', 55), ('turned', 54), ('lips', 54), ("one's", 54), ('better', 54), ('got', 54), ('vane', 54), ('right',53), ('left', 53), ('course', 52), ('hands', 52), ('portrait', 52), ('head', 51), ("can't", 49), ('true', 49), ('house', 49), ('believe', 49), ('black', 49), ('horrible', 48), ('oh', 48), ('knew', 47), ('curious', 47), ('myself', 47)]

After getting this list, I want to draw histogram using matplotlib. I am trying something as below, but I am not able to draw a proper histogram.

My question: How do I pass the total frequency to the graph? All of my bars are at the same height right now. And even the bin center is not correct. How should I pass data to the ax.hist method on below code? I am trying to update the example from http://matplotlib.org/1.2.1/examples/api/histogram_demo.html.

totalWords = counts.most_common(100)
print(totalWords)
for z in range(len(totalWords)):
    words.append(totalWords[z][0])

x = np.arange(len(words))
#print x
i, s = 100, 15

fig = plt.figure()
ax = fig.add_subplot(111)

n, bins, patches = ax.hist(x, 50, normed=1, facecolor='green', alpha=0.75)


bincenters = 0.5*(bins[1:]+bins[:-1])

y = mlab.normpdf(bincenters*1.00, i, s)
l = ax.plot(bincenters, y, 'r--', linewidth=1)

ax.set_xlabel('Words')
ax.set_ylabel('Frequency')
ax.set_xlim(50, 160)
ax.set_ylim(0, 0.04)
ax.grid(True)

plt.show()
Alexander
  • 105,104
  • 32
  • 201
  • 196
user3314492
  • 233
  • 5
  • 17

2 Answers2

0

Perhaps this is what you want. This code produces a bar chart with each bar representing individual words and vertical axis provides the frequency of word in your text.

The counts array you provided is not sorted as I had expected though.

import numpy as np
import matplotlib.pyplot as plt

counts = [('test', 510), ('Hey', 362), ("please", 753), ('take', 446), ('herbert', 325), ('live', 222), ('hate', 210), ('white', 191), ('simple', 175), ('harry', 172), ('woman', 170), ('basil', 153), ('things', 129), ('think', 126), ('bye', 124), ('thing', 120), ('love', 107), ('quite', 107), ('face', 107), ('eyes', 107), ('time', 106), ('himself', 105), ('want', 105), ('good', 105), ('really', 103), ('away',100), ('did', 100), ('people', 99), ('came', 97), ('say', 97), ('cried', 95), ('looked', 94), ('tell', 92), ('look', 91), ('world', 89), ('work', 89), ('project', 88), ('room', 88), ('going', 87), ('answered', 87), ('mr', 87), ('little', 87), ('yes', 84), ('silly', 82), ('thought', 82), ('shall', 81), ('circle', 80), ('hallward', 80), ('told', 77), ('feel', 76), ('great', 74), ('art', 74), ('dear',73), ('picture', 73), ('men', 72), ('long', 71), ('young', 70), ('lady', 69), ('let', 66), ('minute', 66), ('women', 66), ('soul', 65), ('door', 64), ('hand',63), ('went', 63), ('make', 63), ('night', 62), ('asked', 61), ('old', 61), ('passed', 60), ('afraid', 60), ('night', 59), ('looking', 58), ('wonderful', 58), ('gutenberg-tm', 56), ('beauty', 55), ('sir', 55), ('table', 55), ('turned', 54), ('lips', 54), ("one's", 54), ('better', 54), ('got', 54), ('vane', 54), ('right',53), ('left', 53), ('course', 52), ('hands', 52), ('portrait', 52), ('head', 51), ("can't", 49), ('true', 49), ('house', 49), ('believe', 49), ('black', 49), ('horrible', 48), ('oh', 48), ('knew', 47), ('curious', 47), ('myself', 47)]
words = [x[0] for x in counts]
values = [int(x[1]) for x in counts]
print words
mybar = plt.bar(range(len(words)), values, color='green', alpha=0.4)

plt.xlabel('Word Index')
plt.ylabel('Frequency')
plt.title('Word Frequency Chart')
plt.legend()

plt.show()

You can see the graph following a ziphian curve (power law curve). Modify the code to suit your need.

enter image description here

Aditya
  • 3,080
  • 24
  • 47
  • Thank you for answering @Aditya. But I am looking for Histogram here . That is why I was not sure what to pass on .hist method. – user3314492 Jun 07 '15 at 16:11
  • Dude histogram is on sorted numerical quantities. Did you have a look at the graph my code generates? That's exactly what you want. In hist() you've to tell the 'bins' - which are blocks of real numbers on the x-axis. Where are real numbers in the words? – Aditya Jun 08 '15 at 04:27
  • Well @user3314492 what's different in this graph from your expectations? – Aditya Jun 08 '15 at 04:43
  • This is awesome. How do I adapt this to add the words on the X axis @Aditya? – Dhruv Ghulati Sep 07 '16 at 15:10
  • Also how did you get it so there are only labels every 20 increments? – Dhruv Ghulati Sep 07 '16 at 15:19
0

It's a little unclear exactly what you want to graph, and how relevant the matplotlib demo you are adapting actually is.

I'll run through some options, and try and answer your specific questions in each case:

  • Using the matplotlib demo, you only need to give ax.hist the list of word frequencies x = words[n][1] ,but this just gives you the relative frequency of the different frequencies... so most of the words occur <100 times, while a couple of words occur much more frequently. This is why your code above returns a histogram of equal bars, because you are giving ax.hist the numbers from 0 to 99 once each. Note that this approach doesn't show the individual words

  • Otherwise, I think you want a bar chart with each bar labelled as a different word.

This worked for me.

words = [('test', 510), ('Hey', 362), ("please", 753), ('take', 446),     ('herbert', 325), ('live', 222), ('hate', 210), ('white', 191), ('simple', 175),     ('harry', 172), ('woman', 170), ('basil', 153), ('things', 129), ('think', 126), ('bye', 124), ('thing', 120), ('love', 107), ('quite', 107), ('face', 107), ('eyes', 107), ('time', 106), ('himself', 105), ('want', 105), ('good', 105), ('really', 103), ('away',100), ('did', 100), ('people', 99), ('came', 97), ('say', 97), ('cried', 95), ('looked', 94), ('tell', 92), ('look', 91), ('world', 89), ('work', 89), ('project', 88), ('room', 88), ('going', 87), ('answered', 87), ('mr', 87), ('little', 87), ('yes', 84), ('silly', 82), ('thought', 82), ('shall', 81), ('circle', 80), ('hallward', 80), ('told', 77), ('feel', 76), ('great', 74), ('art', 74), ('dear',73), ('picture', 73), ('men', 72), ('long', 71), ('young', 70), ('lady', 69), ('let', 66), ('minute', 66), ('women', 66), ('soul', 65), ('door', 64), ('hand',63), ('went', 63), ('make', 63), ('night', 62), ('asked', 61), ('old', 61), ('passed', 60), ('afraid', 60), ('night', 59), ('looking', 58), ('wonderful', 58), ('gutenberg-tm', 56), ('beauty', 55), ('sir', 55), ('table', 55), ('turned', 54), ('lips', 54), ("one's", 54), ('better', 54), ('got', 54), ('vane', 54), ('right',53), ('left', 53), ('course', 52), ('hands', 52), ('portrait', 52), ('head', 51), ("can't", 49), ('true', 49), ('house', 49), ('believe', 49), ('black', 49), ('horrible', 48), ('oh', 48), ('knew', 47), ('curious', 47), ('myself', 47)]
wordsdict = {}
for w in words:
    wordsdict[w[0]]=w[1]

plt.bar(range(len(wordsdict)), wordsdict.values(), align='center')
plt.xticks(range(len(wordsdict)), wordsdict.keys())

plt.show()
TMrtSmith
  • 461
  • 3
  • 16
  • THank you for answer @TMrtSmith. I am looking for Histogram only. Not bar chart. I want graph same or similar to what I have given in example. – user3314492 Jun 07 '15 at 16:09