57

I have a survey file in which row are observation and column question.

Here are some fake data they look like:

People,Food,Music,People
P1,Very Bad,Bad,Good
P2,Good,Good,Very Bad
P3,Good,Bad,Good
P4,Good,Very Bad,Very Good
P5,Bad,Good,Very Good
P6,Bad,Good,Very Good

My aim is to create this kind of plot with ggplot2.

  • I absolutely don't care of the colors, design, etc.
  • The plot doesn't correspond to the fake data

enter image description here

Here are my fake data:

raw <- read.csv("http://pastebin.com/raw.php?i=L8cEKcxS",sep=",")
raw[,2]<-factor(raw[,2],levels=c("Very Bad","Bad","Good","Very Good"),ordered=FALSE)
raw[,3]<-factor(raw[,3],levels=c("Very Bad","Bad","Good","Very Good"),ordered=FALSE)
raw[,4]<-factor(raw[,4],levels=c("Very Bad","Bad","Good","Very Good"),ordered=FALSE)

But if I choose Y as count then I'm facing an issue about choosing the X and the Group values... I don't know if I can succeed without using reshape2... I've also tired to use reshape with melt function. But I don't understand how to use it...

Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
S12000
  • 3,345
  • 12
  • 35
  • 51

2 Answers2

112

EDIT: Many years later

For a pure ggplot2 + utils::stack() solution, see the answer by @markus!


A somewhat verbose tidyverse solution, with all non-base packages explicitly stated so that you know where each function comes from:

library(magrittr) # needed for %>% if dplyr is not attached

"http://pastebin.com/raw.php?i=L8cEKcxS" %>%
  utils::read.csv(sep = ",") %>%
  tidyr::pivot_longer(cols = c(Food, Music, People.1),
                      names_to = "variable",
                      values_to = "value") %>%
  dplyr::group_by(variable, value) %>%
  dplyr::summarise(n = dplyr::n()) %>%
  dplyr::mutate(value = factor(
    value,
    levels = c("Very Bad", "Bad", "Good", "Very Good"))
  ) %>%
  ggplot2::ggplot(ggplot2::aes(variable, n)) +
  ggplot2::geom_bar(ggplot2::aes(fill = value),
                    position = "dodge",
                    stat = "identity")

The original answer:

First you need to get the counts for each category, i.e. how many Bads and Goods and so on are there for each group (Food, Music, People). This would be done like so:

raw <- read.csv("http://pastebin.com/raw.php?i=L8cEKcxS",sep=",")
raw[,2]<-factor(raw[,2],levels=c("Very Bad","Bad","Good","Very Good"),ordered=FALSE)
raw[,3]<-factor(raw[,3],levels=c("Very Bad","Bad","Good","Very Good"),ordered=FALSE)
raw[,4]<-factor(raw[,4],levels=c("Very Bad","Bad","Good","Very Good"),ordered=FALSE)

raw=raw[,c(2,3,4)] # getting rid of the "people" variable as I see no use for it

freq=table(col(raw), as.matrix(raw)) # get the counts of each factor level

Then you need to create a data frame out of it, melt it and plot it:

Names=c("Food","Music","People")     # create list of names
data=data.frame(cbind(freq),Names)   # combine them into a data frame
data=data[,c(5,3,1,2,4)]             # sort columns

# melt the data frame for plotting
data.m <- melt(data, id.vars='Names')

# plot everything
ggplot(data.m, aes(Names, value)) +   
  geom_bar(aes(fill = variable), position = "dodge", stat="identity")

Is this what you're after?

enter image description here

To clarify a little bit, in ggplot multiple grouping bar you had a data frame that looked like this:

> head(df)
  ID Type Annee X1PCE X2PCE X3PCE X4PCE X5PCE X6PCE
1  1    A  1980   450   338   154    36    13     9
2  2    A  2000   288   407   212    54    16    23
3  3    A  2020   196   434   246    68    19    36
4  4    B  1980   111   326   441    90    21    11
5  5    B  2000    63   298   443   133    42    21
6  6    B  2020    36   257   462   162    55    30

Since you have numerical values in columns 4-9, which would later be plotted on the y axis, this can be easily transformed with reshape and plotted.

For our current data set, we needed something similar, so we used freq=table(col(raw), as.matrix(raw)) to get this:

> data
   Names Very.Bad Bad Good Very.Good
1   Food        7   6    5         2
2  Music        5   5    7         3
3 People        6   3    7         4

Just imagine you have Very.Bad, Bad, Good and so on instead of X1PCE, X2PCE, X3PCE. See the similarity? But we needed to create such structure first. Hence the freq=table(col(raw), as.matrix(raw)).

jakub
  • 4,774
  • 4
  • 29
  • 46
  • Hello thank you is exactely what I want. Thanks. I just have a question is it also possible to avoid ' `raw=raw[,c(2,3,4)] freq=table(col(raw), as.matrix(raw))` and do everything with reshape? Because I had the same kind of issue http://stackoverflow.com/questions/17303573/ggplot-multiple-grouping-bar and in this post I only used reshape. I'm confused about it... – S12000 Aug 10 '13 at 13:41
  • Well, I'm not sure. The `raw=raw[,c(2,3,4)]` is there only because it has no sense to include the observation indicator (as you do not plot individual observations in the subsequent plot). Therefore, the counts is the only thing that matters. Whether you can do it all with `reshape`, I don't know. My guess is that you can't. – jakub Aug 10 '13 at 13:47
  • Well, actually, the data in this current post is different in that it does not contain the numerical counts. Have a look at the columns 4-9 in the data frame from the post you are linking to: they contain numerical values, melted subsequently by Didzis to create the `value` variable in melted data frame. We did not have any values, so we needed to create them first. Hence `freq=table(col(raw), as.matrix(raw))`. (I added more extensive explanation at the end of my answer). – jakub Aug 10 '13 at 14:13
  • Ah true. I got it. Thanks Basically with categorical data like in this post there is one more step... Thanks for your very good explanation. – S12000 Aug 10 '13 at 14:26
  • Sorry to disturb again, I have another question, do you know if is it possible to display the frequency (or percentage) on each bar ? – S12000 Aug 13 '13 at 03:15
  • Thanks ;-) Maybe you can find your answer [here](http://stackoverflow.com/questions/2551921/show-frequencies-along-with-barplot-in-ggplot2) – jakub Aug 13 '13 at 12:05
  • Or [maybe here](http://stackoverflow.com/questions/10327267/annotation-above-bars) – jakub Aug 13 '13 at 12:19
  • sorry I succeeder for "simple" barplot but I cant figure out how to do it with multiple barplot when "melt" is used. Thanks – S12000 Aug 20 '13 at 15:54
  • What would the percentage be equal to? Would it be for instance equal to (number of "Very good" in "Food") / (total number of answers in "Food") ? – jakub Aug 24 '13 at 08:56
  • Anyway, the most straightforward way I can think of is to count the percentage beforehand on your melted data frame (especially in this case, where you already have frequencies of answers). For instance: `ddply(data.m, .(Names), summarize, ratio=value/sum(value))` will calculate the percentage I mentioned in my previous comment. Then you can use something like `geom_text(aes(label = sprintf("%1.2f%%", 100*ratio),x = variable,y = value),position = position_dodge(width = 0.8), vjust=-.6)` to display those in the plot. – jakub Aug 24 '13 at 09:55
4

In @jakub's answer the calculations are done before the data is passed to ggplot(), which is why the stat in geom_bar is set to "identity" (i.e. take the data as is and do nothing with it).

Another approach is to let ggplot do the counting for you, hence we can make use of stat = "count", the default of geom_bar:

library(ggplot2)
ggplot(stack(df1[, -1]), aes(ind, fill = values)) +
         geom_bar(position = "dodge")

enter image description here

data

df1 <- read.csv(text = "People,Food,Music,People
P1,Very Bad,Bad,Good
P2,Good,Good,Very Bad
P3,Good,Bad,Good
P4,Good,Very Bad,Very Good
P5,Bad,Good,Very Good
P6,Bad,Good,Very Good
P7,Bad,Very Bad,Good
P8,Very Good,Very Bad,Good
P9,Very Bad,Good,Bad
P10,Bad,Good,Very Bad
P11,Good,Bad,Very Bad
P12,Very Bad,Bad,Very Good
P13,Bad,Very Good,Bad
P14,Bad,Very Good,Very Bad
P15,Good,Good,Good
P16,Very Bad,Very Good,Very Bad
P17,Very Bad,Good,Good
P18,Very Bad,Very Bad,Bad
P19,Very Good,Very Bad,Very Bad
P20,Very Bad,Bad,Good", header = TRUE)
markus
  • 25,843
  • 5
  • 39
  • 58