0

I have a data.frame that looks like this:

   Sample_long_name        Cluster     Sample_shortname
  S1_AAACCCAAGAGCCTGA        4         S1
  S1_AAACCCAAGCTTAAGA        4         S1
  S1_AAACCCACACGGCGTT        3         S1
  S2_AAACCCACACTACCGG        3         S2
  S3_AAACCCACAGCTGAGA        3         S3
  S3_AAACCCACATAGATGA        1         S3

I would like the following output:

   Sample_long_name        Cluster     Sample_shortname
  S1_AAACCCAAGAGCCTGA        4         Cl4_cell1
  S1_AAACCCAAGCTTAAGA        4         Cl4_cell2
  S1_AAACCCACACGGCGTT        3         Cl3_cell1
  S2_AAACCCACACTACCGG        3         Cl3_cell2
  S3_AAACCCACAGCTGAGA        3         Cl3_cell3
  S3_AAACCCACATAGATGA        1         Cl1_cell1
  .......................

In other words, based on the number of the cluster I would like to enumerate the cells. The ordering in the first column does not matter. In total I have 30.000 cells for around 12 clusters.

halfer
  • 19,824
  • 17
  • 99
  • 186
Bfu38
  • 1,081
  • 1
  • 8
  • 17
  • Note that we prefer a technical style of writing here. We gently discourage greetings, hope-you-can-helps, thanks, advance thanks, notes of appreciation, regards, kind regards, signatures, please-can-you-helps, chatty material and abbreviated txtspk, pleading, how long you've been stuck, voting advice, meta commentary, etc. Just explain your problem, and show what you've tried, what you expected, and what actually happened. – halfer Sep 12 '20 at 11:09

1 Answers1

1

We can use paste or str_c based on the grouping sequence

library(dplyr)
library(stringr)
library(data.table)
df1 %>%
   mutate(Sample_shortname = str_c("Cl", Cluster, "_cell", rowid(Cluster)))
#     Sample_long_name Cluster Sample_shortname
#1 S1_AAACCCAAGAGCCTGA       4        Cl4_cell1
#2 S1_AAACCCAAGCTTAAGA       4        Cl4_cell2
#3 S1_AAACCCACACGGCGTT       3        Cl3_cell1
#4 S2_AAACCCACACTACCGG       3        Cl3_cell2
#5 S3_AAACCCACAGCTGAGA       3        Cl3_cell3
#6 S3_AAACCCACATAGATGA       1        Cl1_cell1

Or using base R

df1$Sample_shortname <- with(df1, sprintf("Cl%d_cell%d", 
      Cluster, ave(seq_along(Cluster), Cluster, FUN = seq_along(Cluster))))

data

df1 <- structure(list(Sample_long_name = c("S1_AAACCCAAGAGCCTGA", 
     "S1_AAACCCAAGCTTAAGA", 
"S1_AAACCCACACGGCGTT", "S2_AAACCCACACTACCGG", "S3_AAACCCACAGCTGAGA", 
"S3_AAACCCACATAGATGA"), Cluster = c(4L, 4L, 3L, 3L, 3L, 1L), 
    Sample_shortname = c("S1", "S1", "S1", "S2", "S3", "S3")),
    class = "data.frame", row.names = c(NA, 
-6L))
akrun
  • 874,273
  • 37
  • 540
  • 662