I have a large dataset that includes information for multiple sequences, detailing their sequence ID, country of origin, clade, host and many other things. Each country has multiple different sequences and some countries contain sequences from multiple different clades. Is there a way to know the number of different clades for each different country, without having to test each country one by one (there are too many to realistically enter by hand)?
-
1It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Feb 17 '20 at 16:53
2 Answers
Without example data, I will rephrase your question as:
I have a large dataset that includes information for multiple starwars characters, detailing their eye color, homeworld, name, and many other things. Each homeworld has multiple different eye colors. Is there a way to know the number of different eye colors for each different homeworld, without having to test each homeworld one by one (there are too many to realistically enter by hand)?
Here, we can count how many times different combinations of homeworld and eye color exist in the data. For instance, we have three brown eyed characters from Alderaan.
library(dplyr)
starwars %>% count(homeworld, eye_color)
# A tibble: 66 x 3
homeworld eye_color n
<chr> <chr> <int>
1 Alderaan brown 3
2 Aleen Minor unknown 1
3 Bespin blue 1
4 Bestine IV blue 1
5 Cato Neimoidia red 1
6 Cerea yellow 1
7 Champala blue 1
8 Chandrila blue 1
9 Concord Dawn brown 1
10 Corellia brown 1
# … with 56 more rows
We could add another step to count how many eye colors appear on each homeworld, by counting the number of rows for each homeworld from the step before. This tells us there is only one eye color found on Alderaan (brown).
starwars %>% count(homeworld, eye_color) %>% count(homeworld)
# A tibble: 49 x 2
homeworld n
<chr> <int>
1 Alderaan 1
2 Aleen Minor 1
3 Bespin 1
4 Bestine IV 1
5 Cato Neimoidia 1
6 Cerea 1
7 Champala 1
8 Chandrila 1
9 Concord Dawn 1
10 Corellia 2

- 55,165
- 4
- 35
- 53
Assuming, your dataframe
df has columns
like 'country' and 'clade' then you can run:
aggregate(data=df, clade ~ country, FUN=function(x) length(unique(x)))

- 1,862
- 1
- 12
- 26

- 4,152
- 1
- 7
- 21