0

I have a large dataset that includes information for multiple sequences, detailing their sequence ID, country of origin, clade, host and many other things. Each country has multiple different sequences and some countries contain sequences from multiple different clades. Is there a way to know the number of different clades for each different country, without having to test each country one by one (there are too many to realistically enter by hand)?

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
  • 1
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Feb 17 '20 at 16:53

2 Answers2

1

Without example data, I will rephrase your question as:

I have a large dataset that includes information for multiple starwars characters, detailing their eye color, homeworld, name, and many other things. Each homeworld has multiple different eye colors. Is there a way to know the number of different eye colors for each different homeworld, without having to test each homeworld one by one (there are too many to realistically enter by hand)?

Here, we can count how many times different combinations of homeworld and eye color exist in the data. For instance, we have three brown eyed characters from Alderaan.

library(dplyr) 
starwars %>% count(homeworld, eye_color)
# A tibble: 66 x 3
   homeworld      eye_color     n
   <chr>          <chr>     <int>
 1 Alderaan       brown         3
 2 Aleen Minor    unknown       1
 3 Bespin         blue          1
 4 Bestine IV     blue          1
 5 Cato Neimoidia red           1
 6 Cerea          yellow        1
 7 Champala       blue          1
 8 Chandrila      blue          1
 9 Concord Dawn   brown         1
10 Corellia       brown         1
# … with 56 more rows

We could add another step to count how many eye colors appear on each homeworld, by counting the number of rows for each homeworld from the step before. This tells us there is only one eye color found on Alderaan (brown).

starwars %>% count(homeworld, eye_color) %>% count(homeworld)
# A tibble: 49 x 2
   homeworld          n
   <chr>          <int>
 1 Alderaan           1
 2 Aleen Minor        1
 3 Bespin             1
 4 Bestine IV         1
 5 Cato Neimoidia     1
 6 Cerea              1
 7 Champala           1
 8 Chandrila          1
 9 Concord Dawn       1
10 Corellia           2
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
1

Assuming, your dataframe df has columns like 'country' and 'clade' then you can run:

aggregate(data=df, clade ~ country, FUN=function(x) length(unique(x)))
Manoj Kumar Dhakad
  • 1,862
  • 1
  • 12
  • 26
George Savva
  • 4,152
  • 1
  • 7
  • 21