0

I am new to R and struggling with grouping my dataset. This is an example of the data:

sample profile
1 A
2 A,B
3 A,B
4 A,C
5 C
6 A,C

I am trying to group the profiles so that the same profiles are labelled as the same group:

sample profile profile group/cluster
genome 1 A 1
genome 2 A,B 2
genome 3 A,B 2
genome 4 A,C 3
genome 5 C 4
genome 6 A,C 3

From this, profiles A,B and A,C have been grouped together.

I have tried playing around with these packages

library(tidyverse)
library(janitor)
library(stringr)

dupes <- get_dupes(database, profile)
dupes


ll_by_outcome <- as.data.frame(database %>% 
  group_by(profile) %>% 
    add_count())
ll_by_outcome

But these just find duplicates within the sample. I am not sure how to go about this issue. Any help is appreciated!

Gab
  • 1
  • 1

3 Answers3

1

We could use match

library(dplyr)
library(stringr)
df1 %>% 
  mutate(group = match(profile, unique(profile)), 
     sample = str_c('genome ', sample))

-output

     sample profile group
1 genome 1       A     1
2 genome 2     A,B     2
3 genome 3     A,B     2
4 genome 4     A,C     3
5 genome 5       C     4
6 genome 6     A,C     3

data

df1 <- structure(list(sample = 1:6, profile = c("A", "A,B", "A,B", "A,C", 
"C", "A,C")), class = "data.frame", row.names = c(NA, -6L))
akrun
  • 874,273
  • 37
  • 540
  • 662
1

You can do it using factors.

With the data from @akrun's answer:

df1 %>% mutate(cluster = as.numeric(factor(profile)))
akrun
  • 874,273
  • 37
  • 540
  • 662
yuk
  • 19,098
  • 13
  • 68
  • 99
  • @akrun, I don't think `levels` are necessary here. By default, the levels are sorted unique categories. If you set the levels by `unique` they can be unsorted. So, depends on OP's preference. Not sure if it adds anything to the speed. – yuk Sep 13 '22 at 16:01
  • You are right. I was thinking if it should be a custom order. rolled back to your version – akrun Sep 13 '22 at 16:04
0

Does this work:

library(dplyr)

df %>% mutate(sample = str_c('genome', sample, sep = ' ')) %>% group_by(profile) %>% mutate(cluster = cur_group_id())
# A tibble: 6 × 3
# Groups:   profile [4]
  sample   profile cluster
  <chr>    <chr>     <int>
1 genome 1 A             1
2 genome 2 A,B           2
3 genome 3 A,B           2
4 genome 4 A,C           3
5 genome 5 C             4
6 genome 6 A,C           3
Karthik S
  • 11,348
  • 2
  • 11
  • 25