I have asked a similar question in Python (How to create column of ascending values based on unique values in another column in pandas), and got the script working, but for various reasons I need to achieve the same thing in R now. I am also adding complexity here of wanting to be able to add new batches of data to the dataset periodically.
I have a list of samples which have unique sample ID numbers ("Sample_ID"). Each row of the dataset is a sample. Some samples are duplicated multiple times. I want to create a new set of sample names ("Sample_code") that ascends up from 1 as you go down the rows using a prefix (e.g. "SAMP00001", "SAMP00002" etc). I want the order of rows to be preserved (as they are roughly in date order of sample collection). And for duplicated samples, I want the number given for Sample_code to correspond to the first row that sample ID appears in, not rows further down the table (which came later in sample collection).
My starting data is illustrated with df1:
# df1
Sample_ID <- c('123123','123456','123123','123789')
Date <- c('15/06/2019', '23/06/2019', '30/06/2019', '07/07/2019')
Variable <- c("blue","red","red","blue")
Batch <- 1
df1 <- data.frame(Sample_ID, Date, Variable, Batch)
df1
I want to create the Sample_code column shown in df1b:
# df1b
Sample_ID <- c('123123','123456','123123','123789')
Date <- c('15/06/2019', '23/06/2019', '30/06/2019', '07/07/2019')
Variable <- c("blue","red","red","blue")
Batch <- 1
Sample_code <- c('SAMP0001', 'SAMP0002', 'SAMP0001', 'SAMP0003')
df1b <- data.frame(Sample_ID, Date, Variable, Batch, Sample_code)
df1b
I would save df1b at this point and those Sample_code names used for downstream processing. The added complexity comes because I will then collect a new batch of samples - let's call it df2 (Batch 2 samples):
# df2
Sample_ID <- c('456789', '123654', '123123', '123789', '121212')
Date <- c('15/07/2019', '31/07/2019', '12/08/2019', '27/08/2019', '31/08/2019')
Variable <- c("blue", "red","blue", "red", "red")
Batch <- 2
df2 <- data.frame(Sample_ID, Date, Variable, Batch)
df2
I want to rbind df2 to the bottom of df1, and generate more Sample_code names for the new rows. Importantly, the new Sample_code names need to take account of any Sample_ID duplicates that were present in df1, but also not change any of the Sample_code names that were already assigned back when I only had df1. The result at this point would be df2b, below:
# df2b
Sample_ID <- c('123123','123456','123123','123789','456789', '123654', '123123', '123789', '121212')
Date <- c('15/06/2019', '23/06/2019', '30/06/2019', '07/07/2019', '15/07/2019', '31/07/2019', '12/08/2019', '27/08/2019', '31/08/2019')
Variable <- c("blue","red","red","blue","blue", "red","blue", "red", "red")
Batch <- c(1,1,1,1,2,2,2,2,2)
Sample_code <- c('SAMP0001', 'SAMP0002', 'SAMP0001', 'SAMP0003', 'SAMP0004', 'SAMP0005', 'SAMP0001', 'SAMP0003', 'SAMP0006')
df2b <- data.frame(Sample_ID, Date, Variable, Batch, Sample_code)
df2b
And then I would add Batch 3 samples in the same way etc etc.
I appreciate there are at least 2 stages to this problem: 1) Producing an ascending list of Sample_code names using unique Sample_ID values; and 2) Building in an iterative way of adding batches of samples. But because the second point impacts on the functionality I want for the Sample_code names I have included both stages here.
Lastly - ideally I want to only use base R and tidyverse packages for this.
Any help much appreciated! Thanks.