How to group duplicate strings in a dataframe in R?

Question

I am new to R and struggling with grouping my dataset. This is an example of the data:

sample	profile
1	A
2	A,B
3	A,B
4	A,C
5	C
6	A,C

I am trying to group the profiles so that the same profiles are labelled as the same group:

sample	profile	profile group/cluster
genome 1	A	1
genome 2	A,B	2
genome 3	A,B	2
genome 4	A,C	3
genome 5	C	4
genome 6	A,C	3

From this, profiles A,B and A,C have been grouped together.

I have tried playing around with these packages

library(tidyverse)
library(janitor)
library(stringr)

dupes <- get_dupes(database, profile)
dupes


ll_by_outcome <- as.data.frame(database %>% 
  group_by(profile) %>% 
    add_count())
ll_by_outcome

But these just find duplicates within the sample. I am not sure how to go about this issue. Any help is appreciated!

score 1 · Answer 1 · answered Sep 13 '22 at 15:47

We could use match

library(dplyr)
library(stringr)
df1 %>% 
  mutate(group = match(profile, unique(profile)), 
     sample = str_c('genome ', sample))

-output

     sample profile group
1 genome 1       A     1
2 genome 2     A,B     2
3 genome 3     A,B     2
4 genome 4     A,C     3
5 genome 5       C     4
6 genome 6     A,C     3

data

df1 <- structure(list(sample = 1:6, profile = c("A", "A,B", "A,B", "A,C", 
"C", "A,C")), class = "data.frame", row.names = c(NA, -6L))

score 1 · Answer 2 · edited Sep 13 '22 at 16:03

1

You can do it using factors.

With the data from @akrun's answer:

df1 %>% mutate(cluster = as.numeric(factor(profile)))

edited Sep 13 '22 at 16:03

akrun

874,273
37
540
662

answered Sep 13 '22 at 15:55

yuk

19,098
13
68
99

@akrun, I don't think `levels` are necessary here. By default, the levels are sorted unique categories. If you set the levels by `unique` they can be unsorted. So, depends on OP's preference. Not sure if it adds anything to the speed. – yuk Sep 13 '22 at 16:01
You are right. I was thinking if it should be a custom order. rolled back to your version – akrun Sep 13 '22 at 16:04

score 0 · Answer 3 · answered Sep 13 '22 at 15:51

Does this work:

library(dplyr)

df %>% mutate(sample = str_c('genome', sample, sep = ' ')) %>% group_by(profile) %>% mutate(cluster = cur_group_id())
# A tibble: 6 × 3
# Groups:   profile [4]
  sample   profile cluster
  <chr>    <chr>     <int>
1 genome 1 A             1
2 genome 2 A,B           2
3 genome 3 A,B           2
4 genome 4 A,C           3
5 genome 5 C             4
6 genome 6 A,C           3

How to group duplicate strings in a dataframe in R?

3 Answers3

data