Stacked bar plot of large data in python

Question

I would like to plot a stacked bar plot from a csv file in python. I have three columns of data

year word frequency
2018 xyz 12
2017 gfh 14
2018 sdd 10
2015 fdh 1
2014 sss 3
2014 gfh 12
2013 gfh 2
2012 gfh 4
2011 wer 5
2010 krj 4
2009 krj 4
2019 bfg 4
... 300+ rows of data.

I need to go through all the data and plot a stacked bar plot which is categorized based on the year, so x axis is word and y axis is frequency, the legend color should show year wise. I want to see how the evolution of each word occured year wise. Some of the technology words are repeatedly used in every year and hence the stack bar graph should add the values on top and plot, for example the word gfh initially plots 14 for year 2017, and then in year 2014 I want the gfh word to plot (in a different color) for a value of 12 on top of the gfh of 2017. How do I do this? So far I called the csv file in my code. But I don't understand how could it go over all the rows and stack the words appropriately (as some words repeat through all the years). Any help is highly appreciated. Also the years are arranged in random order in csv but I sorted them year wise to make it easier. I am just learning python and trying to understand this plotting routine since i have 40 years of data and ~20 words. So I thought stacked bar plot is the best way to represent them. Any other visualisation method is also welcome.

Alex · Accepted Answer · 2020-04-12T21:31:27.760

1

This can be done using pandas:

import pandas as pd

df = pd.read_csv("file.csv")

# Aggregate data
df = df.groupby(["word", "year"], as_index=False).agg({"frequency": "sum"})
# Create list to sort by
sorter = (
    df.groupby(["word"], as_index=False)
    .agg({"frequency": "sum"})
    .sort_values("frequency")["word"]
    .values
)

# Pivot, reindex, and plot
df = df.pivot(index="word", columns="year", values="frequency")
df = df.reindex(sorter)
df.plot.bar(stacked=True)

Which outputs:

edited Apr 12 '20 at 21:31

answered Apr 12 '20 at 20:20

Alex

6,610
3
20
38

I get a error as this ValueError: Index contains duplicate entries, cannot reshape. Could you tell me what am I doing wrong – astronaut Apr 12 '20 at 20:27
This means that your csv file has rows that are duplicated on `word` and `year`. You'll need to combine them first, I've updated my answer. – Alex Apr 12 '20 at 20:34
Hi thanks I was able to plot it. How can I prevent the legend from overlapping on my values? – astronaut Apr 12 '20 at 20:53
https://stackoverflow.com/questions/4700614/how-to-put-the-legend-out-of-the-plot – Alex Apr 12 '20 at 20:54
Thank you and is there a way I can have the plot in ascending order based on stack height? – astronaut Apr 12 '20 at 21:06
And my final question is since I have 40 years(1981-2020), I feel the legend color repeats quite a bit. Is there a way that I could prevent overlapping of the colors so the user can read the year clearly? – astronaut Apr 12 '20 at 21:24
I've updated the answer to reflect a different sort. I'm not sure about how to best display 40 years of data, colour is probably not the best method. I would ask another question! – Alex Apr 12 '20 at 21:32

Stacked bar plot of large data in python

1 Answers1