I am trying to calculate a % yield of some data based on a subset:
# example data set
set.seed(10)
Measurement <- rnorm(1000, 5, 2)
ID <- rep(c(1:100), each=10)
Batch <- rep(c(1:10), each=100)
df <- data.frame(Batch, ID, Measurement)
df$ID <- factor(df$ID)
df$Batch <- factor(df$Batch)
# Subset data based on measurement range
pass <- subset(df, Measurement > 6 & Measurement < 7)
# Calculate number of rows in data frame (by Batch then ID)
ac <- ddply(df, c("Batch", "ID"), nrow)
colnames(ac) <- c("Batch", "ID", "Total")
# Calculate number of rows in subset (by Batch then ID)
bc <- ddply(pass, c("Batch", "ID"), nrow)
colnames(bc) <- c("Batch", "ID", "Pass")
# Calculate yield
bc$Yield <- (bc$Pass / ac$Total) * 100
# plot yield
ggplot(bc, aes(ID, Yield, colour=Batch)) + geom_point()
My problem is that, due to my filter range (between 6 and 7) my subset (pass) has less rows than my data frame (df)
nrow(ac)
[1] 100
nrow(bc)
[1] 83
Therefore I cannot use
bc$Yield <- (bc$Pass / ac$Total) * 100
Or I get the error
replacement has 100 rows, data has 83
The reason I am trying to keep in generic is because my real data has varying batch and ID amounts (otherwise I could just divide by a constant in my yield calculation). Can anyone tell me how to put a 0 in my subset if the data falls outside of the limits (6 to 7 in this case). Or point out an more elegant way of calculating yield. Thank you
Update:
str(df)
'data.frame': 1000 obs. of 3 variables:
$ Batch : Factor w/ 10 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ID : Factor w/ 100 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Measurement: num 5.04 4.63 2.26 3.8 5.59 ...