Combine or Sum rows based on partial match and other rules

Question

I have a dataframe df1:

df1 <- data.frame(
  Lot = c("13VC011","13VC018","13VC011A","13VC011B","13VC018A","13VC018C","13VC018B"),
  Date = c("2013-07-12","2013-07-11","2013-07-13","2013-07-14","2013-07-16","2013-07-18","2013-07-19"),
  Step = c("A","A","B","B","C","C","C"),
  kg = c(31,32,14,16,10,11,10))

Sometimes at a particular 'Step' a 'Lot' gets split into A,B or C as indicated. I'd like to sum those and get a dataframe that tells me the total kg at each step, for each lot.

For example the output should look like this:

df2 <- data.frame(
  Lot = c("13VC011","13VC011","13VC018","13VC018"),
  Step = c("A","B","A","C"),
  kg = c(31,30,32,31))

So there are two requirements. If the 'Lot' matches, regardless of the trailing letter, and the step matches, then the sum occurs. If both conditions are not satisfied, then just carry over the line item as is into df2.

Part2: So I would like to introduce a 3rd requirement. In some cases, the Lot was split in two or 3 parts, however not all the data is present. In this case, using these solutions masks this and makes it appear that one lot has much lower kg than it actually has.

What I would like to do is find a way to indicate if the dataset contains 13VC011A for example, but no 13VC011B. Or if we see a 'B' but no 'A' or a 'C' but no 'B' or 'A'.

So now the original dataframe is:

df1 <- data.frame(
  Lot = c("13VC011","13VC018","13VC011A","13VC011B","13VC018A","13VC018C","13VC018B","13VC020B"),
  Date = c("2013-07-12","2013-07-11","2013-07-13","2013-07-14","2013-07-16","2013-07-18","2013-07-19","2013-07-22"),
  Step = c("A","A","B","B","C","C","C","B"),
  kg = c(31,32,14,16,10,11,10,18))

And the resultant df2 should look something like:

df2 <- data.frame(
  Lot = c("13VC011","13VC011","13VC018","13VC018","13VC020B"),
  Step = c("A","B","A","C","B"),
  kg = c(31,30,32,31,18),
  Partial = c(F,F,F,F,T))

Thanks for both answers, I will evaluate which works best after some sleep. Creative use of aggregate. — variable, Jul 23 '14 at 05:58

akrun · Answer 1 · 2014-07-23T09:14:41.093

   df1$Lot <- gsub("[[:alpha:]]$","",df1$Lot) #replace the character element at the end of string with `""`
   aggregate(kg~Lot+Step,df1, FUN=sum)
   #    Lot Step kg
 #1 13VC011    A 31
 #2 13VC011    B 30
 #3 13VC018    A 32
 #4 13VC018    C 31

Or using dplyr

  library(stringr)
  library(dplyr)
  df1%>%
  group_by(Lot=str_extract(Lot,perl('.*\\d(?=[A-Z]?$)')), Step) %>%
  summarize(kg=sum(kg))
   #Source: local data frame [4 x 3]
   #Groups: Lot

   #     Lot Step kg
   #1 13VC011    A 31
   #2 13VC011    B 30
   #3 13VC018    A 32
   #4 13VC018    C 31

Explanation

regex

.* : select more than one element

\\d :followed by a digit

(?=[A-Z]?$) : and lookahead for character elements or (?) not at the $ end of string.

`

score 1 · Answer 2 · answered Jul 23 '14 at 05:36

> aggregate(kg ~Lot + Step, data=df1, FUN=sum)
       Lot Step kg
1  13VC011    A 31
2  13VC018    A 32
3 13VC011A    B 14
4 13VC011B    B 16
5 13VC018A    C 10
6 13VC018B    C 10
7 13VC018C    C 11

At that point I finally understood what you meant by "regardless of the trailing letter" and wondered if the formula method of aggregate could handle an R-function in one of the terms:

> aggregate(kg ~substr(Lot,1,7) + Step, data=df1, FUN=sum)
  substr(Lot, 1, 7) Step kg
1           13VC011    A 31
2           13VC018    A 32
3           13VC011    B 30
4           13VC018    C 31

Combine or Sum rows based on partial match and other rules

2 Answers2

Explanation

Linked