15

I have a long list of words, and I want to generate a histogram of the frequency of each word in my list. I was able to do that in the code below:

import csv
from collections import Counter
import numpy as np

word_list = ['A','A','B','B','A','C','C','C','C']

counts = Counter(merged)

labels, values = zip(*counts.items())

indexes = np.arange(len(labels))

plt.bar(indexes, values)
plt.show()

It doesn't, however, display the bins by rank (i.e. by frequency, so highest frequency is first bin on the left and so on), even though when I print counts it orders them for me Counter({'C': 4, 'A': 3, 'B': 2}). How could I achieve that?

Cleb
  • 25,102
  • 20
  • 116
  • 151
BKS
  • 2,227
  • 4
  • 32
  • 53

1 Answers1

26

You can achieve the desired output by sorting your data first and then pass the ordered arrays to bar; below I use numpy.argsort for that. The plot then looks as follows (I also added the labels to the bar):

enter image description here

Here is the code that produces the plot with a few inline comments:

from collections import Counter
import numpy as np
import matplotlib.pyplot as plt

word_list = ['A', 'A', 'B', 'B', 'A', 'C', 'C', 'C', 'C']

counts = Counter(word_list)

labels, values = zip(*counts.items())

# sort your values in descending order
indSort = np.argsort(values)[::-1]

# rearrange your data
labels = np.array(labels)[indSort]
values = np.array(values)[indSort]

indexes = np.arange(len(labels))

bar_width = 0.35

plt.bar(indexes, values)

# add labels
plt.xticks(indexes + bar_width, labels)
plt.show()

In case you want to plot only the first n entries, you can replace the line

counts = Counter(word_list)

by

counts = dict(Counter(word_list).most_common(n))

In the case above, counts would then be

{'A': 3, 'C': 4}

for n = 2.

If you like to remove the frame of the plot and label the bars directly, you can check this post.

Cleb
  • 25,102
  • 20
  • 116
  • 151
  • 1
    I have more than 4000 words to count, so how to generate word frequency histogram of only top 20 words? –  Dec 14 '17 at 08:28
  • @AAKM: You can use `counts.most_common(20)` i.e. `counts = Counter(word_list).most_common(20)`. – Cleb Dec 14 '17 at 08:34
  • AttributeError Traceback (most recent call last) in () 5 counts = Counter(df['Text']).most_common(10) 6 ----> 7 labels, values = zip(*counts.items()) 8 9 # sort your values in descending order AttributeError: 'list' object has no attribute 'items' –  Dec 14 '17 at 08:47
  • 1
    @AAKM: True, `most_common` returns a list, not a dictionary, I updated the post. So, `dict(Counter(word_list).most_common(20))` should work for you now. – Cleb Dec 14 '17 at 08:55