0

I am trying to calculate a % yield of some data based on a subset:

# example data set
set.seed(10)
Measurement <- rnorm(1000, 5, 2)
ID <- rep(c(1:100), each=10)
Batch <- rep(c(1:10), each=100)

df <- data.frame(Batch, ID, Measurement)

df$ID <- factor(df$ID)
df$Batch <- factor(df$Batch)

# Subset data based on measurement range

pass <- subset(df, Measurement > 6 & Measurement < 7)

# Calculate number of rows in data frame (by Batch then ID)

ac <- ddply(df, c("Batch", "ID"), nrow)
colnames(ac) <- c("Batch", "ID", "Total")

# Calculate number of rows in subset (by Batch then ID)

bc <- ddply(pass, c("Batch", "ID"), nrow)
colnames(bc) <- c("Batch", "ID", "Pass")

# Calculate yield 

bc$Yield <- (bc$Pass / ac$Total) * 100

# plot yield

ggplot(bc, aes(ID, Yield, colour=Batch)) + geom_point()

My problem is that, due to my filter range (between 6 and 7) my subset (pass) has less rows than my data frame (df)

nrow(ac)
[1] 100

nrow(bc)
[1] 83

Therefore I cannot use

    bc$Yield <- (bc$Pass / ac$Total) * 100

Or I get the error

replacement has 100 rows, data has 83

The reason I am trying to keep in generic is because my real data has varying batch and ID amounts (otherwise I could just divide by a constant in my yield calculation). Can anyone tell me how to put a 0 in my subset if the data falls outside of the limits (6 to 7 in this case). Or point out an more elegant way of calculating yield. Thank you

Update:

str(df)

'data.frame':   1000 obs. of  3 variables:
 $ Batch      : Factor w/ 10 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ ID         : Factor w/ 100 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Measurement: num  5.04 4.63 2.26 3.8 5.59 ...
Pete900
  • 2,016
  • 1
  • 21
  • 44
  • Do you want `bc$Yield` to be 83 for all rows in `bc`? – Alexey Ferapontov May 26 '15 at 13:20
  • `ac$Total` is always equal to `nrow(df)`. Does that match the structure of your data? If so, you can try `bc$Yield <- (bc$Pass / nrow(df)) * 100` – Pierre L May 26 '15 at 13:31
  • Alexey, I want bc$Yield to be the number of rows in bc split by Batch then ID divided by the number of rows in ac split by Batch then ID. platfort, nrow(df) = 1000, where as for the yield I need it split by Batch then ID. So there is a total of 10 measurements for each ID. which means for 100% yield it would be 10/10. – Pete900 May 26 '15 at 13:41
  • ive added some code to show the structure of the df and pass. – Pete900 May 26 '15 at 13:44

1 Answers1

1

I think this is what you want. I've done it using dplyr's group_by and summarize here.

For each Batch/ID it calculates the number of observations, the number of observations where measurement is between 6 and 7 and the ratio of those two.

library(dplyr)

# example data set
set.seed(10)
Measurement <- rnorm(1000, 5, 2)
ID <- rep(c(1:100), each=10)
Batch <- rep(c(1:10), each=100)

df <- data.frame(Batch, ID, Measurement)

df$ID <- factor(df$ID)
df$Batch <- factor(df$Batch)

# Subset data based on measurement range

countFunc <- function(x) sum((x > 6)&(x<7))

# Calculate number of rows, rows that meet criteria, and yield.

totals <- df %>% group_by(Batch, ID) %>%
  summarize(total = length(Measurement), x = countFunc(Measurement)) %>%
  mutate(yield = x/total) %>%
  as.data.frame()
jrdnmdhl
  • 1,935
  • 18
  • 26
  • Lovely, thank you! I only had to add (x/total)*100 for the % bit. This is ultimately going to be used in a reactive function for a shiny app – Pete900 May 26 '15 at 13:57