Python Histogram using matplotlib on top words

Question

I am reading a file and calculating the frequency of the top 100 words. I am able to find that and create the following list:

[('test', 510), ('Hey', 362), ("please", 753), ('take', 446), ('herbert', 325), ('live', 222), ('hate', 210), ('white', 191), ('simple', 175), ('harry', 172), ('woman', 170), ('basil', 153), ('things', 129), ('think', 126), ('bye', 124), ('thing', 120), ('love', 107), ('quite', 107), ('face', 107), ('eyes', 107), ('time', 106), ('himself', 105), ('want', 105), ('good', 105), ('really', 103), ('away',100), ('did', 100), ('people', 99), ('came', 97), ('say', 97), ('cried', 95), ('looked', 94), ('tell', 92), ('look', 91), ('world', 89), ('work', 89), ('project', 88), ('room', 88), ('going', 87), ('answered', 87), ('mr', 87), ('little', 87), ('yes', 84), ('silly', 82), ('thought', 82), ('shall', 81), ('circle', 80), ('hallward', 80), ('told', 77), ('feel', 76), ('great', 74), ('art', 74), ('dear',73), ('picture', 73), ('men', 72), ('long', 71), ('young', 70), ('lady', 69), ('let', 66), ('minute', 66), ('women', 66), ('soul', 65), ('door', 64), ('hand',63), ('went', 63), ('make', 63), ('night', 62), ('asked', 61), ('old', 61), ('passed', 60), ('afraid', 60), ('night', 59), ('looking', 58), ('wonderful', 58), ('gutenberg-tm', 56), ('beauty', 55), ('sir', 55), ('table', 55), ('turned', 54), ('lips', 54), ("one's", 54), ('better', 54), ('got', 54), ('vane', 54), ('right',53), ('left', 53), ('course', 52), ('hands', 52), ('portrait', 52), ('head', 51), ("can't", 49), ('true', 49), ('house', 49), ('believe', 49), ('black', 49), ('horrible', 48), ('oh', 48), ('knew', 47), ('curious', 47), ('myself', 47)]

After getting this list, I want to draw histogram using matplotlib. I am trying something as below, but I am not able to draw a proper histogram.

My question: How do I pass the total frequency to the graph? All of my bars are at the same height right now. And even the bin center is not correct. How should I pass data to the ax.hist method on below code? I am trying to update the example from http://matplotlib.org/1.2.1/examples/api/histogram_demo.html.

totalWords = counts.most_common(100)
print(totalWords)
for z in range(len(totalWords)):
    words.append(totalWords[z][0])

x = np.arange(len(words))
#print x
i, s = 100, 15

fig = plt.figure()
ax = fig.add_subplot(111)

n, bins, patches = ax.hist(x, 50, normed=1, facecolor='green', alpha=0.75)


bincenters = 0.5*(bins[1:]+bins[:-1])

y = mlab.normpdf(bincenters*1.00, i, s)
l = ax.plot(bincenters, y, 'r--', linewidth=1)

ax.set_xlabel('Words')
ax.set_ylabel('Frequency')
ax.set_xlim(50, 160)
ax.set_ylim(0, 0.04)
ax.grid(True)

plt.show()

You need histogram or bar chart? In histogram, the x-axis is a continuous variable over real numbers. Bar chart is categorical - like individual words. — Aditya, Jun 07 '15 at 07:38
Thank you for answering Aditya. I am looking for Histogram only and not bar chart. — user3314492, Jun 07 '15 at 16:08
@user3314492 tell me difference between histogram and bar chart according to your knowledge. — Aditya, Jun 08 '15 at 04:44
@Detroitteatime seems so. I'll see how he replies to my comments only then we can be clear or else we'll begin flagging. — Aditya, Jun 08 '15 at 04:44

Aditya · Answer 1 · 2015-06-08T04:43:02.830

Perhaps this is what you want. This code produces a bar chart with each bar representing individual words and vertical axis provides the frequency of word in your text.

The counts array you provided is not sorted as I had expected though.

import numpy as np
import matplotlib.pyplot as plt

counts = [('test', 510), ('Hey', 362), ("please", 753), ('take', 446), ('herbert', 325), ('live', 222), ('hate', 210), ('white', 191), ('simple', 175), ('harry', 172), ('woman', 170), ('basil', 153), ('things', 129), ('think', 126), ('bye', 124), ('thing', 120), ('love', 107), ('quite', 107), ('face', 107), ('eyes', 107), ('time', 106), ('himself', 105), ('want', 105), ('good', 105), ('really', 103), ('away',100), ('did', 100), ('people', 99), ('came', 97), ('say', 97), ('cried', 95), ('looked', 94), ('tell', 92), ('look', 91), ('world', 89), ('work', 89), ('project', 88), ('room', 88), ('going', 87), ('answered', 87), ('mr', 87), ('little', 87), ('yes', 84), ('silly', 82), ('thought', 82), ('shall', 81), ('circle', 80), ('hallward', 80), ('told', 77), ('feel', 76), ('great', 74), ('art', 74), ('dear',73), ('picture', 73), ('men', 72), ('long', 71), ('young', 70), ('lady', 69), ('let', 66), ('minute', 66), ('women', 66), ('soul', 65), ('door', 64), ('hand',63), ('went', 63), ('make', 63), ('night', 62), ('asked', 61), ('old', 61), ('passed', 60), ('afraid', 60), ('night', 59), ('looking', 58), ('wonderful', 58), ('gutenberg-tm', 56), ('beauty', 55), ('sir', 55), ('table', 55), ('turned', 54), ('lips', 54), ("one's", 54), ('better', 54), ('got', 54), ('vane', 54), ('right',53), ('left', 53), ('course', 52), ('hands', 52), ('portrait', 52), ('head', 51), ("can't", 49), ('true', 49), ('house', 49), ('believe', 49), ('black', 49), ('horrible', 48), ('oh', 48), ('knew', 47), ('curious', 47), ('myself', 47)]
words = [x[0] for x in counts]
values = [int(x[1]) for x in counts]
print words
mybar = plt.bar(range(len(words)), values, color='green', alpha=0.4)

plt.xlabel('Word Index')
plt.ylabel('Frequency')
plt.title('Word Frequency Chart')
plt.legend()

plt.show()

You can see the graph following a ziphian curve (power law curve). Modify the code to suit your need.

enter image description here

Thank you for answering @Aditya. But I am looking for Histogram here . That is why I was not sure what to pass on .hist method. — user3314492, Jun 07 '15 at 16:11
Dude histogram is on sorted numerical quantities. Did you have a look at the graph my code generates? That's exactly what you want. In hist() you've to tell the 'bins' - which are blocks of real numbers on the x-axis. Where are real numbers in the words? — Aditya, Jun 08 '15 at 04:27
Well @user3314492 what's different in this graph from your expectations? — Aditya, Jun 08 '15 at 04:43
This is awesome. How do I adapt this to add the words on the X axis @Aditya? — Dhruv Ghulati, Sep 07 '16 at 15:10
Also how did you get it so there are only labels every 20 increments? — Dhruv Ghulati, Sep 07 '16 at 15:19

score 0 · Answer 2 · answered Jun 07 '15 at 07:53

It's a little unclear exactly what you want to graph, and how relevant the matplotlib demo you are adapting actually is.

I'll run through some options, and try and answer your specific questions in each case:

Using the matplotlib demo, you only need to give ax.hist the list of word frequencies x = words[n][1] ,but this just gives you the relative frequency of the different frequencies... so most of the words occur <100 times, while a couple of words occur much more frequently. This is why your code above returns a histogram of equal bars, because you are giving ax.hist the numbers from 0 to 99 once each. Note that this approach doesn't show the individual words
Otherwise, I think you want a bar chart with each bar labelled as a different word.

This worked for me.

words = [('test', 510), ('Hey', 362), ("please", 753), ('take', 446),     ('herbert', 325), ('live', 222), ('hate', 210), ('white', 191), ('simple', 175),     ('harry', 172), ('woman', 170), ('basil', 153), ('things', 129), ('think', 126), ('bye', 124), ('thing', 120), ('love', 107), ('quite', 107), ('face', 107), ('eyes', 107), ('time', 106), ('himself', 105), ('want', 105), ('good', 105), ('really', 103), ('away',100), ('did', 100), ('people', 99), ('came', 97), ('say', 97), ('cried', 95), ('looked', 94), ('tell', 92), ('look', 91), ('world', 89), ('work', 89), ('project', 88), ('room', 88), ('going', 87), ('answered', 87), ('mr', 87), ('little', 87), ('yes', 84), ('silly', 82), ('thought', 82), ('shall', 81), ('circle', 80), ('hallward', 80), ('told', 77), ('feel', 76), ('great', 74), ('art', 74), ('dear',73), ('picture', 73), ('men', 72), ('long', 71), ('young', 70), ('lady', 69), ('let', 66), ('minute', 66), ('women', 66), ('soul', 65), ('door', 64), ('hand',63), ('went', 63), ('make', 63), ('night', 62), ('asked', 61), ('old', 61), ('passed', 60), ('afraid', 60), ('night', 59), ('looking', 58), ('wonderful', 58), ('gutenberg-tm', 56), ('beauty', 55), ('sir', 55), ('table', 55), ('turned', 54), ('lips', 54), ("one's", 54), ('better', 54), ('got', 54), ('vane', 54), ('right',53), ('left', 53), ('course', 52), ('hands', 52), ('portrait', 52), ('head', 51), ("can't", 49), ('true', 49), ('house', 49), ('believe', 49), ('black', 49), ('horrible', 48), ('oh', 48), ('knew', 47), ('curious', 47), ('myself', 47)]
wordsdict = {}
for w in words:
    wordsdict[w[0]]=w[1]

plt.bar(range(len(wordsdict)), wordsdict.values(), align='center')
plt.xticks(range(len(wordsdict)), wordsdict.keys())

plt.show()

THank you for answer @TMrtSmith. I am looking for Histogram only. Not bar chart. I want graph same or similar to what I have given in example. — user3314492, Jun 07 '15 at 16:09

Python Histogram using matplotlib on top words

2 Answers2

Linked