Index rows based on the content of a column of a data.frame

Question

I have a data.frame that looks like this:

   Sample_long_name        Cluster     Sample_shortname
  S1_AAACCCAAGAGCCTGA        4         S1
  S1_AAACCCAAGCTTAAGA        4         S1
  S1_AAACCCACACGGCGTT        3         S1
  S2_AAACCCACACTACCGG        3         S2
  S3_AAACCCACAGCTGAGA        3         S3
  S3_AAACCCACATAGATGA        1         S3

I would like the following output:

   Sample_long_name        Cluster     Sample_shortname
  S1_AAACCCAAGAGCCTGA        4         Cl4_cell1
  S1_AAACCCAAGCTTAAGA        4         Cl4_cell2
  S1_AAACCCACACGGCGTT        3         Cl3_cell1
  S2_AAACCCACACTACCGG        3         Cl3_cell2
  S3_AAACCCACAGCTGAGA        3         Cl3_cell3
  S3_AAACCCACATAGATGA        1         Cl1_cell1
  .......................

In other words, based on the number of the cluster I would like to enumerate the cells. The ordering in the first column does not matter. In total I have 30.000 cells for around 12 clusters.

Note that we prefer a technical style of writing here. We gently discourage greetings, hope-you-can-helps, thanks, advance thanks, notes of appreciation, regards, kind regards, signatures, please-can-you-helps, chatty material and abbreviated txtspk, pleading, how long you've been stuck, voting advice, meta commentary, etc. Just explain your problem, and show what you've tried, what you expected, and what actually happened. — halfer, Sep 12 '20 at 11:09

score 1 · Accepted Answer · answered Jul 05 '20 at 18:44

We can use paste or str_c based on the grouping sequence

library(dplyr)
library(stringr)
library(data.table)
df1 %>%
   mutate(Sample_shortname = str_c("Cl", Cluster, "_cell", rowid(Cluster)))
#     Sample_long_name Cluster Sample_shortname
#1 S1_AAACCCAAGAGCCTGA       4        Cl4_cell1
#2 S1_AAACCCAAGCTTAAGA       4        Cl4_cell2
#3 S1_AAACCCACACGGCGTT       3        Cl3_cell1
#4 S2_AAACCCACACTACCGG       3        Cl3_cell2
#5 S3_AAACCCACAGCTGAGA       3        Cl3_cell3
#6 S3_AAACCCACATAGATGA       1        Cl1_cell1

Or using base R

df1$Sample_shortname <- with(df1, sprintf("Cl%d_cell%d", 
      Cluster, ave(seq_along(Cluster), Cluster, FUN = seq_along(Cluster))))

data

df1 <- structure(list(Sample_long_name = c("S1_AAACCCAAGAGCCTGA", 
     "S1_AAACCCAAGCTTAAGA", 
"S1_AAACCCACACGGCGTT", "S2_AAACCCACACTACCGG", "S3_AAACCCACAGCTGAGA", 
"S3_AAACCCACATAGATGA"), Cluster = c(4L, 4L, 3L, 3L, 3L, 1L), 
    Sample_shortname = c("S1", "S1", "S1", "S2", "S3", "S3")),
    class = "data.frame", row.names = c(NA, 
-6L))

Thank you very much! I preferred the first solution. It works perfectly! — Bfu38, Jul 05 '20 at 19:01

Index rows based on the content of a column of a data.frame

1 Answers1

data