how to generate new IDs based on aggregated data in r

Question

I have a data frame which is similar to the output result in this answered questionabout aggregate data by group

but I would like to create a column for new IDs based on the count of unique cluster in DF

DF = read.table(text="name cluster count
       F00851.3     20      2
       F00851.2     20      2
         F00851     20      2
       F00851.8     20      2
       F00851.4     20      2
       F00851.5     20      2
       F00851.1     20      2
       F00851.6     21      2
       F00851.7     21      2
       F00958.2     23      1
       F00958.1     23      1
       F00958.3     23      1
         F00958     23      1
       F01404.5     28      3
         F01404     28      3
       F01404.4     28      3
       F01404.3     28      3
       F01404.6     29      3
       F01404.1     29      3
       F01404.7     29      3
       F01404.2     30      3
       F01404.8     30      3", header=T, stringsAsFactors=F)

The expected result:

result = read.table(text="name  cluster count   ID
    F00851.3    20  2   F00851.1
    F00851.2    20  2   F00851.1
    F00851  20  2   F00851.1
    F00851.8    20  2   F00851.1
    F00851.4    20  2   F00851.1
    F00851.5    20  2   F00851.1
    F00851.1    20  2   F00851.1
    F00851.6    21  2   F00851.2
    F00851.7    21  2   F00851.2
    F00958.2    23  1   F00958.1
    F00958.1    23  1   F00958.1
    F00958.3    23  1   F00958.1
    F00958  23  1   F00958.1
    F01404.5    28  3   F01404.1
    F01404  28  3   F01404.1
    F01404.4    28  3   F01404.1
    F01404.3    28  3   F01404.1
    F01404.6    29  3   F01404.2
    F01404.1    29  3   F01404.2
    F01404.7    29  3   F01404.2
    F01404.2    30  3   F01404.3
    F01404.8    30  3   F01404.3", header=T, stringsAsFactors=F)

In my case, the group is substr(DF$name,1,6). So the new column ID should be substr(DF$name,1,6) plus extension separated by dot. the extension number is the series number of unique values in cluster column for each group.
Appreciate any helps.

www · Answer 1 · 2017-08-11T20:22:45.040

3

DF2 is the final output.

library(dplyr)
library(tidyr)

DF2 <- DF %>%
  separate(name, into = c("name1", "number1"), remove = FALSE) %>%
  group_by(name1) %>%
  mutate(number2 = dense_rank(cluster)) %>%
  unite(ID, name1, number2, sep = ".", remove = FALSE) %>%
  ungroup() %>%
  select(name, cluster, count, ID)

edited Aug 11 '17 at 20:22

answered Aug 11 '17 at 20:12

www

38,575
12
48
84

1

I think you can use `dplyr::dense_rank` instead of `data.table::rleid` to stay in the tidyverse – cderv Aug 11 '17 at 20:21
1

@cderv I did not know this. Thanks for the comment! I have updated the answer. – www Aug 11 '17 at 20:23

score 3 · Accepted Answer · answered Aug 11 '17 at 20:21

We can do this easily with data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'count' get the run-length-id of 'cluster' (rleid) and paste it with the substring created either with sub or using substr(name, 1, 6) to create the 'ID' column

library(data.table)
setDT(DF)[, ID := paste(sub("\\..*$", "", name), rleid(cluster), sep="."), count]
DF
#        name cluster count       ID
# 1: F00851.3      20     2 F00851.1
# 2: F00851.2      20     2 F00851.1
# 3:   F00851      20     2 F00851.1
# 4: F00851.8      20     2 F00851.1
# 5: F00851.4      20     2 F00851.1
# 6: F00851.5      20     2 F00851.1
# 7: F00851.1      20     2 F00851.1
# 8: F00851.6      21     2 F00851.2
# 9: F00851.7      21     2 F00851.2
#10: F00958.2      23     1 F00958.1
#11: F00958.1      23     1 F00958.1
#12: F00958.3      23     1 F00958.1
#13:   F00958      23     1 F00958.1
#14: F01404.5      28     3 F01404.1
#15:   F01404      28     3 F01404.1
#16: F01404.4      28     3 F01404.1
#17: F01404.3      28     3 F01404.1
#18: F01404.6      29     3 F01404.2
#19: F01404.1      29     3 F01404.2
#20: F01404.7      29     3 F01404.2
#21: F01404.2      30     3 F01404.3
#22: F01404.8      30     3 F01404.3

BENY · Answer 3 · 2017-08-11T21:08:57.420

Just using dplyr

DF$name1=substr(DF$name,1,6)
DF1=DF%>%group_by(name1,cluster)%>%dplyr::mutate(Row=ifelse(row_number()==1,1,0))%>%ungroup()
DF1=DF1%>%group_by(name1)%>%arrange(cluster)%>%dplyr::mutate(ID=paste0(name1,'.',cumsum(Row)))%>%select(name, cluster, count, ID)
   # A tibble: 22 x 5
# Groups:   name1 [3]
    name1     name cluster count       ID
    <chr>    <chr>   <int> <int>    <chr>
 1 F00851 F00851.3      20     2 F00851.1
 2 F00851 F00851.2      20     2 F00851.1
 3 F00851   F00851      20     2 F00851.1
 4 F00851 F00851.8      20     2 F00851.1
 5 F00851 F00851.4      20     2 F00851.1
 6 F00851 F00851.5      20     2 F00851.1
 7 F00851 F00851.1      20     2 F00851.1
 8 F00851 F00851.6      21     2 F00851.2
 9 F00851 F00851.7      21     2 F00851.2
10 F00958 F00958.2      23     1 F00958.1

You need `ungroup()` before the last `select` call to drop `name1`. Other than that, this is a perfect solution. — www, Aug 11 '17 at 21:07

how to generate new IDs based on aggregated data in r

3 Answers3