-1

I'd think this would be fairly fundamental but can't find how to do it in any introductory texts that I have nor by googling. I want to plot mean of a continuous variable by a categorical variable and then group by a factor. The continuous variable is 'cd' (blood cd4 protein), the categorical is year (1 - 10 years), the factor is failure = 0 or 1. My dataset is 'F3'

I've used aggregate to get the mean cd by year, but can't find how to group that by failure (0,1) for no and yes. Would prefer to use ggplot.

The plot I get from this:

ggplot(F3, aes(factor(year), mean(cd), color = factor(failure))) + 
geom_line()    + 
geom_point(size=2)

enter image description here

is a horizontal line or two lines overlaid, but indicating a group by failure in a legend. So, it's not plotting the mean cd by year, just the overall mean. Please help.

Data:

F3 <- structure(list(year = structure(c(6L, 7L, 8L, 9L, 10L, 1L, 2L, 
3L, 4L, 5L, 6L), .Label = c("1", "2", "3", "4", "5", "6", "7", 
"8", "9", "10"), class = "factor"), cd = c(555L, 511L, 540L, 
596L, 553L, 142L, 173L, 271L, 163L, 108L, 61L), failure = structure(c(1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("0", "1"), class = "factor")), .Names = c("year", 
"cd", "failure"), row.names = c("1", "2", "3", "4", "5", "6", 
"7", "8", "9", "10", "11"), class = "data.frame")
mpalanco
  • 12,960
  • 2
  • 59
  • 67
JuanTamad
  • 65
  • 9

2 Answers2

0

Still unsure, but perhaps this is what you want to do? Using the larger dataset:

library(ggplot2)
library(dplyr)

F4 <- F3 %>% group_by(year, failure) %>% summarize(cd = mean(cd))

ggplot(F4, aes(year, cd, color = failure, group = failure)) +
  geom_point() + geom_line()

enter image description here

Including standard error of the mean:

F4 <- F3 %>% group_by(year, failure) %>% 
  summarize(mean.cd = mean(cd), se = sd(cd) / sqrt(n()))
F4$failure <- factor(F4$failure)

pos <- position_dodge(width = 0.2)

ggplot(F4, aes(year, mean.cd, color = failure, ymin = mean.cd - se, 
               ymax = mean.cd + se, group = failure)) +
  geom_point(position = pos) + geom_line(position = pos) + 
  geom_errorbar(position = pos, width = 0.2)

Note that some of the points have only one value, so you can't calculate the SEM or sd.

enter image description here

Axeman
  • 32,068
  • 8
  • 81
  • 94
  • This ggplot2 code is what I did first. It's working here for a few cd values, but I need to mean cd for each year. There are over 3000 cd values in the full dataset. When all are plotted its an unreadable line chart with the cd values are sort of binned by year. – JuanTamad Oct 04 '15 at 02:10
  • Your question gave different code. It is unclear to me what the problem is then. – Axeman Oct 04 '15 at 09:25
  • Doesn't look like dplyr splice can select variables by another variable. – JuanTamad Oct 04 '15 at 10:44
  • Can you access the pastebin? This is a 60-row slice http://pastebin.com/raw.php?i=SjH4u1Dq – JuanTamad Oct 04 '15 at 10:46
  • I think I need to create a new variable that is tied wth each cd value to the year it belongs with. – JuanTamad Oct 04 '15 at 10:47
  • Ok, I changed the answer. – Axeman Oct 04 '15 at 11:38
  • Beautiful, thanks. My first attempt to get SDs with expected 'sd = sd(cd)' is not working, but I'll work on it. Sorry about the data issue, I'll have to read up on it. Source: local data frame [6 x 4] Groups: year [3] year failure cd sd (int) (int) (dbl) (dbl) 1 1 0 466.1103 NA – JuanTamad Oct 04 '15 at 20:02
  • If you wanted sd's you should have put that in the question. You're making this a lot of work for others to figure out for you. – Axeman Oct 04 '15 at 20:04
  • Sorry, guess I was thinking they go hand in hand, so doing the sd should be simple extension. – JuanTamad Oct 04 '15 at 20:06
0
library(rCharts)
x1 <- xPlot(value ~ year, group = "failure", data = F3, type = "line-dotted")    
x1   
Navin Manaswi
  • 964
  • 7
  • 19
  • 1
    Could you please add some explanation to your answer? Code-only answers are froned upon on SO. – honk Oct 03 '15 at 18:39
  • I think what I need is to create a new variable in the dataset that is the correct mean for cd for each of the 10-year values (cdmean.y), then plot cdmean.y against year by the failure groups. I can get the mean of cd with: F3.slice %>% group_by(year) %>% summarise (mean= mean(cd)) cdmeanbyyear How do I get a dataset like this: year cd failure cdmean.y – JuanTamad Oct 04 '15 at 03:38
  • I suppose you could make 10 sub-datasets for each year, then merge back into one? Is there an easier way? – JuanTamad Oct 04 '15 at 04:10