3

I have a categorical scatter plot like this:

enter image description here

which I generated in R with the following code (using the ggplot2 library):

data <- runif(50, 13, 17)
factors <- as.factor(sample(1:3, 50, replace = TRUE))
groups <- as.factor(sample(1:3, 50, replace = TRUE))
data_table <- data.frame(data, factors)
g <- ggplot(data_table, aes(y = data_table[, 1], x = data_table[, 2], colour = groups)) + geom_point(size = 1.5)

I am trying to add an average line for each x-group, but I can't manage to find the right way. I have already tried with the procedure described in this question, but it doesn't work, I reckon because my x-groups are composed of a single x-value each, for which I believe the procedure should be different.

More in detail, if I add:

+ geom_line(stat = "hline", yintercept = "mean", aes(colour = data_table[, 2]))

to the previous code line, it gives me the following error: geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?.

If I try with the procedure suggest in the answer to that question, by adding:

+ geom_errorbar(stat = "hline", yintercept = "mean", width=0.8, aes(ymax=..y..,ymin=..y..))

to my initial code (I have removed the geom_jitter(position = position_jitter(width = 0.4)) piece of code, because it added random points to my data plot), I get three lines for each group (each corresponding to the mean of the three groups indicated in red, green, blue for that specifical x-group), as shown in this picture:

enter image description here

Does anyone have any suggestion on how to fix this?

Thank you.

Community
  • 1
  • 1
selenocysteine
  • 192
  • 1
  • 7
  • Those lines are not random, they are the means for the `group` variable in each level of `factor(data[,8])`. If you just want the mean of level of `factor(data[,8])`, you have to delete the `aes(colour = group)` in the `geom_line` part. – Jaap Jun 30 '14 at 10:05
  • A side note: it's always better to include some example data in your question. See [this question on how to give a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – Jaap Jun 30 '14 at 10:08
  • Thank you for the side note, I did not add examples because they come from proprietary data, but I will try to add some now. You are right about the nature of those lines, but there was a misunderstanding: they are not plotted when I add `geom_line`, but only when I add `+ geom_errorbar(stat = "hline", yintercept = "mean", width=0.8,aes(ymax=..y..,ymin=..y..))`. If I try to remove the `aes` from this part, R returns an error message (geom_errorbar requires the following missing aesthetics: ymin, ymax). – selenocysteine Jun 30 '14 at 11:08
  • See my answer. Is this what you're looking for? – Jaap Jun 30 '14 at 11:59
  • 1
    Why don't you use a box plot? – rrs Jun 30 '14 at 15:02
  • @rrs A good suggestion (see my updated answer). However, the middle line represents the median and not the mean. As OP wants to plot the mean, I think my first solution is better than a boxplot. – Jaap Jun 30 '14 at 15:35

1 Answers1

3

The following code should give you the desired result:

# creating reproducible data
set.seed(1)
data <- runif(50, 13, 17)
factors <- as.factor(sample(1:3, 50, replace = TRUE))
groups <- as.factor(sample(1:3, 50, replace = TRUE))
data_table <- data.frame(data, factors, groups)

# creating the plot
ggplot(data=data_table, aes(x=factor(factors), y=data, color=groups)) + 
  geom_point() +
  geom_errorbar(stat = "hline", yintercept = "mean", width=0.6, aes(ymax=..y.., ymin=..y.., group=factor(factors)), color="black")

which gives: enter image description here

Checking whether the means are correct:

> by(data_table$data, data_table$factors, mean)
data_table$factors: 1
[1] 15.12186
------------------------------------------------------------------------------------------------- 
data_table$factors: 2
[1] 15.03746
------------------------------------------------------------------------------------------------- 
data_table$factors: 3
[1] 15.24869

which leads to the conclusion that the means are correctly displayed in the plot.


Following the suggestion of @rrs, you could also combine it with a boxplot:

ggplot(data=data_table, aes(x=factor(factors), y=data, color=groups)) + 
  geom_boxplot(aes(middle=mean(data), color=NULL)) +
  geom_point(size=2.5)

which gives: enter image description here

However, the middle line represents the median and not the mean.

Jaap
  • 81,064
  • 34
  • 182
  • 193
  • [It looks like](https://github.com/hadley/ggplot2/issues/1259) there may not be a `hline` statistic in future versions of ggplot2. As far as I can tell you get the same result using `stat = "summary", fun.y = "mean"` instead of `stat = "hline", yintercept = "mean"`. – aosmith Aug 12 '15 at 21:41