0

I have a data frame, looking like this


    Chrom       Pos Ref Alt sample_id cluster_id cellular_prevalence
1   chr11  70176412   C   G  SRC125_1          0              0.5389
8   chr12  10370686   G   A  SRC125_1          0              0.5389
15  chr12  40892074   T   A  SRC125_1          0              0.5389
22  chr12  53663629   G   T  SRC125_1          0              0.5389
29  chr13 103387098   C   T  SRC125_1          0              0.5389
36  chr13  24334244   G   T  SRC125_1          0              0.5389
....
....
   Chrom       Pos Ref Alt sample_id cluster_id cellular_prevalence
1086  chr3  12531337   G   C  SRC125_1          6              0.2675
1093  chr3  12531455   G   C  SRC125_1          6              0.2675
1100  chr3  12531462   G   A  SRC125_1          6              0.2675
1107  chr5 178460018   T   A  SRC125_1          6              0.2675
1114  chr5 180048230   C   T  SRC125_1          6              0.2675

Total number of clusters:

unique(my_data$cluster_id)
0 1 2 3 4 5 6 7

I want to remove clusters that have only one mutation per sample_id and rename the clusters based on the removed cluster. Just as an example in my dataset, cluster 2 has only one mutation per sample_id, I removed it and now want rename the clusters after removing cluster2 so cluster 3 will renamed as cluster2, cluster 4 -> cluster3, cluster 5 -> cluster4 and so on

How can I do it in R?

anna1335
  • 3
  • 2
  • 1
    Hi anna1335! Welcome to SO! You'll want to include enough data to reproduce your problem, this includes data for more than two clusters including "cluster 2". You'll probably also want to explain what "one mutation per sample_id" means in your case and how to identify it. See more on how to make a great example (and thus get the best answers): https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – harre Feb 22 '23 at 19:46
  • Assuming your clusters are in order, `my_data$cluster_id = match(my_data$cluster_id, my_data$cluster_id) - 1`. – Gregor Thomas Feb 22 '23 at 19:48
  • Everything after the removed cluster (here cluster 2) should be subtracted by one. So clusters 0 and 1 should remain the same. – anna1335 Feb 22 '23 at 19:51

1 Answers1

0

Assuming that you have already removed the clusters you don't want as you indicate in the comments, the remaining problem is solely to rename the clusters.

For this, we could use cur_group_id from dplyr on the arranged data set.

library(dplyr) # dplyr >= 1.1.0

df |>
  arrange(cluster_id) |>
  mutate(new_cluster_id = cur_group_id()-1,
         .by = "cluster_id",
         .after = "cluster_id")

Output:

# A tibble: 15 × 9
   rownumber Chrom Pos       Ref   Alt   sample_id cluster_id new_cluster_id cellular_prevalence
   <chr>     <chr> <chr>     <chr> <chr> <chr>          <dbl>          <dbl>               <dbl>
 1 1         chr11 70176412  C     G     SRC125_1           0              0               0.539
 2 8         chr12 10370686  G     A     SRC125_1           0              0               0.539
 3 15        chr12 40892074  T     A     SRC125_1           0              0               0.539
 4 22        chr12 53663629  G     T     SRC125_1           0              0               0.539
 5 29        chr13 103387098 C     T     SRC125_1           0              0               0.539
 6 36        chr13 24334244  G     T     SRC125_1           0              0               0.539
 7 00        xxxxx xxxxxxxx  X     X     XXXXXX_0           1              1               0    
 8 00        xxxxx xxxxxxxx  X     X     XXXXXX_0           3              2               0    
 9 00        xxxxx xxxxxxxx  X     X     XXXXXX_0           4              3               0    
10 00        xxxxx xxxxxxxx  X     X     XXXXXX_0           5              4               0    
11 1086      chr3  12531337  G     C     SRC125_1           6              5               0.268
12 1093      chr3  12531455  G     C     SRC125_1           6              5               0.268
13 1100      chr3  12531462  G     A     SRC125_1           6              5               0.268
14 1107      chr5  178460018 T     A     SRC125_1           6              5               0.268
15 1114      chr5  180048230 C     T     SRC125_1           6              5               0.268

Data (please include a minimal reproducible example, see comment):

library(readr)

df <- read_table("rownumber Chrom       Pos Ref Alt sample_id cluster_id cellular_prevalence
1   chr11  70176412   C   G  SRC125_1          0              0.5389
8   chr12  10370686   G   A  SRC125_1          0              0.5389
15  chr12  40892074   T   A  SRC125_1          0              0.5389
22  chr12  53663629   G   T  SRC125_1          0              0.5389
29  chr13 103387098   C   T  SRC125_1          0              0.5389
36  chr13  24334244   G   T  SRC125_1          0              0.5389
00  xxxxx  xxxxxxxx   X   X  XXXXXX_0          1              0.0
00  xxxxx  xxxxxxxx   X   X  XXXXXX_0          3              0.0
00  xxxxx  xxxxxxxx   X   X  XXXXXX_0          4              0.0
00  xxxxx  xxxxxxxx   X   X  XXXXXX_0          5              0.0
1086  chr3  12531337   G   C  SRC125_1         6             0.2675
1093  chr3  12531455   G   C  SRC125_1         6             0.2675
1100  chr3  12531462   G   A  SRC125_1         6             0.2675
1107  chr5 178460018   T   A  SRC125_1         6             0.2675
1114  chr5 180048230   C   T  SRC125_1         6             0.2675") 
harre
  • 7,081
  • 2
  • 16
  • 28