0

I am using R to create size frequency histograms for diseased and healthy individuals with fitted normal distribution lines. I have 2 issues that I'm seeking advice on.

  1. How do I create a histogram from aggregated data? The example table below has the summarized number of diseased and healthy individuals within each size.

dput(data)

'structure(list(Size = c(25L, 28L, 31L, 45L, 60L), diseased = c(0L, 
22L, 10L, 5L, 2L), healthy = c(55L, 40L, 15L, 7L, 2L)), .Names = c("Size", 
"diseased", "healthy"), class = "data.frame", row.names = c(NA, 
-5L))'

2.How do I overlay both histograms into 1 figure with fitted normal distribution lines.

I have tried the following code for aggregated data ggplot(data,aes(x=Size,y=diseased))+geom_bar(stat='identity'), which works well, but I can't figure out how to add the histogram for the healthy individuals.

I have also tried using the following text to revert the summarized data (called "data") to the original raw format: raw <- data[rep(1:data, times=data$diseased), "Size", drop=FALSE]

I get the following error message: Error in rep(1:data, times=data$diseased) : invalid 'times' argument. From previous comments, it appears that the rep function can't handle "0"

Victor Zakharov
  • 25,801
  • 18
  • 85
  • 151
  • 1
    Does this post help? http://stackoverflow.com/questions/3541713/how-to-plot-two-histograms-together-in-r or maybe better this: http://stackoverflow.com/questions/6957549/overlaying-histograms-with-ggplot2-in-r – jasonflaherty Feb 25 '13 at 17:15
  • Can you make your data reproducible by showing the result of `dput(data)` or `dput(head(data))`? Also, how can your columns have different numbers of rows? – David Robinson Feb 25 '13 at 17:16
  • @DavidRobinson, I believe it is an exact duplicate of the post linked by buildakicker. – Arun Feb 25 '13 at 17:21
  • @Arun: It's possible, though I'm a bit unsure without more clarification from the OP. – David Robinson Feb 25 '13 at 17:23
  • yes, possibly. number of values in `Size` don't match that of `disease` and `healthy`. And the OP has given a data.frame as input in `ggplot` but taken the trouble to provide separate data... – Arun Feb 25 '13 at 17:25
  • Thank you for your input. Here are the results of dput(data): structure(list(Size = c(25L, 28L, 31L, 45L, 60L), diseased = c(0L, 22L, 10L, 5L, 2L), healthy = c(55L, 40L, 15L, 7L, 2L)), .Names = c("Size", "diseased", "healthy"), class = "data.frame", row.names = c(NA, -5L)) – user2108048 Feb 25 '13 at 20:11

1 Answers1

0

So, I'm in a hurry and I kind of hacked together the normal curve, but you can use this to plot two "histogram-style" plots on top of each other.

It would be easier to get the curves if we had the full data set and not just summaries, of course. I kind of fudged them together, but I think it's enough to get the general idea here.

I'm not totally clear on why you would want to do this, but you can...

library(SDMTools) # Use this to get weighted means

testdata <- structure(list(Size=c(25L, 28L, 31L, 45L, 60L),
                           diseased=c(0L, 22L, 10L, 5L, 2L),
                           healthy=c(55L, 40L, 15L, 7L, 2L)),
                      .Names = c("Size", "diseased", "healthy"),
                      class = "data.frame",
                      row.names = c(NA, -5L))

barplot(testdata$healthy,
        names.arg=paste("                 ",testdata$Size),
        col="light blue",
        border="blue",
        xlim=c(0,6),
        ylim=c(0,70),
        width=0.5,
        space=1)
par(new=TRUE)
barplot(testdata$diseased,
        col="pink",
        border="red",
        xlim=c(0,6),
        ylim=c(0,70),
        width=0.5,
        space=c(2,1,1,1,1))

healthy_mean <- wt.mean(x=testdata$healthy,wt=testdata$Size)
healthy_sd <- wt.sd(x=testdata$healthy,wt=testdata$Size)
diseased_mean <- wt.mean(x=testdata$diseased,wt=testdata$Size)
diseased_sd <- wt.sd(x=testdata$diseased,wt=testdata$Size)

yfit_healthy <- as.data.frame(dnorm(0:max(testdata$healthy),
                              mean=healthy_mean,sd=healthy_sd))
names(yfit_healthy) <- "y"
yfit_diseased <- as.data.frame(dnorm(0:max(testdata$diseased),
                               mean=diseased_mean,sd=diseased_sd))
names(yfit_diseased) <- "y"

yfit_healthy$x <- seq(0,6,length.out=length(yfit_healthy$y))
yfit_diseased$x <- seq(0,6,length.out=length(yfit_diseased$y))

lines(yfit_healthy$x,
      (max(testdata$healthy)*yfit_healthy$y)/max(yfit_healthy$y),
      col="blue",lwd=2)

lines(yfit_diseased$x,
      (max(testdata$diseased)*yfit_diseased$y)/max(yfit_diseased$y),
      col="red",lwd=2)

This code gets me:

Not Exactly My Finest Graph

TARehman
  • 6,659
  • 3
  • 33
  • 60