words frequency using pandas and matplotlib

Question

How can I plot word frequency histogram (for author column)using pandas and matplotlib from a csv file? My csv is like: id, author, title, language Sometimes I have more than one authors in author column separated by space

file = 'c:/books.csv'
sheet = open(file)
df = read_csv(sheet)
print df['author']

For opening a file you should use the `width open(path) as f: ...` idion. Not necessary here, `pandas.read_csv()` can take a path in the first place. Also, be precise in your question. The column name is 'author', not 'authors', right? — Dr. Jan-Philip Gehrcke, Mar 10 '14 at 15:08
"Sometimes I have more than one authors in author column separated by space" -- you really should have made that clear from the beginning. Can you show an example? — Dr. Jan-Philip Gehrcke, Mar 10 '14 at 15:21

score 6 · Accepted Answer · edited May 23 '17 at 11:54

Use collections.Counter for creating the histogram data, and follow the example given here, i.e.:

from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Read CSV file, get author names and counts.
df = pd.read_csv("books.csv", index_col="id")
counter = Counter(df['author'])
author_names = counter.keys()
author_counts = counter.values()

# Plot histogram using matplotlib bar().
indexes = np.arange(len(author_names))
width = 0.7
plt.bar(indexes, author_counts, width)
plt.xticks(indexes + width * 0.5, author_names)
plt.show()

With this test file:

$ cat books.csv 
id,author,title,language
1,peter,t1,de
2,peter,t2,de
3,bob,t3,en
4,bob,t4,de
5,peter,t5,en
6,marianne,t6,jp

the code above creates the following graph:

enter image description here

Edit:

You added a secondary condition, where the author column might contain multiple space-separated names. The following code handles this:

from itertools import chain

# Read CSV file, get 
df = pd.read_csv("books2.csv", index_col="id")
authors_notflat = [a.split() for a in df['author']]
counter = Counter(chain.from_iterable(authors_notflat))
print counter

For this example:

$ cat books2.csv 
id,author,title,language
1,peter harald,t1,de
2,peter harald,t2,de
3,bob,t3,en
4,bob,t4,de
5,peter,t5,en
6,marianne,t6,jp

it prints

$ python test.py 
Counter({'peter': 3, 'bob': 2, 'harald': 2, 'marianne': 1})

Note that this code only works because strings are iterable.

This code is essentially free of pandas, except for the CSV-parsing part that led the DataFrame df. If you need the default plot styling of pandas, then there also is a suggestion in the mentioned thread.

Thank you very much, quite helpful – DevEx Mar 10 '14 at 15:31 — DevEx, Mar 10 '14 at 15:31

score 4 · Answer 2 · answered Mar 10 '14 at 16:54

4

You can count up the number of occurrences of each name using value_counts:

In [11]: df['author'].value_counts()
Out[11]: 
peter       3
bob         2
marianne    1
dtype: int64

Series (and DataFrames) have a hist method for drawing histograms:

In [12]: df['author'].value_counts().hist()

answered Mar 10 '14 at 16:54

Andy Hayden

359,921
101
625
535

I am loving Pandas, but still learning all about it! Does Pandas still give you control over the x-axis labels, as shown in the previous answer? – Nicole Goebel Jan 03 '15 at 20:06
1

I just realized that counting values and plotting as a bar plot does the trick! df['author'].value_counts().plot(kind='bar') Now I just need to rotate the x axis labels! – Nicole Goebel Jan 03 '15 at 20:15
A really nice way. The hist() part isn't working for me though. Can anybody help me out ? – akki May 20 '15 at 11:27
@akki I suspect it's how you've set up matplotlib, does .plot() work? – Andy Hayden May 20 '15 at 16:41
I tried df['author'].value_counts().hist().plot() and then df['author'].value_counts().hist().show() but they don't seem to work for me. – akki May 21 '15 at 11:44
@akki sorry, I meant `df['author'].value_counts().plot()` or just `df.plot()`. Could you clarify what you mean by "does not work"? May be worth asking new question! – Andy Hayden May 21 '15 at 14:56
This solution is only usable for calculating the histogram with value_counts() . But it is not usable for plotting, since hist() try to histogram the result of value_counts() which is already a histogram – Yuval Atzmon Jan 06 '17 at 10:53
1

@user2476373 looks like I didn't update the answer with the comment, to just use plot: `df['author'].value_counts().plot(kind='bar')` – Andy Hayden Jan 06 '17 at 17:51

words frequency using pandas and matplotlib

2 Answers2