0

I have a large dataframe that I have subset to simplify my question, it looks like this:

genome_ID     cluster  
p1.A2           1        
p1.A2           3         
p1.A2           3          
p1.A2           4          
p1.A3           2          
p1.A4           2          
p1.A5           1          
p1.A5           3

And I would like to add a column 'phages' to the dataframe and add numbers corresponding to how many times the genome_ID is present... ie..

  genome_ID     cluster     phages
    p1.A2           1         1
    p1.A2           3         2
    p1.A2           3         3
    p1.A2           4         4
    p1.A3           2         1 
    p1.A4           2         1
    p1.A5           1         1
    p1.A5           3         2

So as you can see the genome_ID p1.A2 is present four times, so there are now four different groupings in the column phages (1-4). p1.A5 is present twice, so there is now numbering from 1-2. If a genome_ID were present fifty times, I would like the column phages to number each from 1-50 (and the order of numbering doesn't matter)

I need to do this so I can subset my dataset more easily to map it to a phylogeny (a biological tree showing evolutionary relationships)

If someone could give me insight to useful R packages and methods that would be very helpful.

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
Taliamycota
  • 125
  • 7

1 Answers1

1

Does this work:

library(dplyr)

df %>% group_by(genome_ID) %>% mutate(phages = row_number())
# A tibble: 8 x 3
# Groups:   genome_ID [4]
  genome_ID cluster phages
  <chr>       <dbl>  <int>
1 p1.A2           1      1
2 p1.A2           3      2
3 p1.A2           3      3
4 p1.A2           4      4
5 p1.A3           2      1
6 p1.A4           2      1
7 p1.A5           1      1
8 p1.A5           3      2
Karthik S
  • 11,348
  • 2
  • 11
  • 25
  • For some reason when I re-started my environment it is no longer working and is giving me this error: Error: `n()` must only be used inside dplyr verbs. Do you have any idea of why this is happening? – Taliamycota Jul 02 '21 at 20:35
  • I fixed it by adding dplyrr:: in front of mutate – Taliamycota Jul 02 '21 at 20:37