0

I have some biological (microbiome) data, where I have a bunch of OTUs that have single names that vary in their taxonomic resolution between the genus and phylum level. I am trying to get a table of all lower level taxonomy than the name I have been given.

 testnames <- c("Prevotella", "Bacteroides", "Enterobacteriaceae")

I've found taxize is a useful package for extracting the information I am looking for.

library("taxize")
reclass <- classification(testnames, db = 'ncbi')

This gets me a list of data frames

That looks like this: classification

And can be entered into R as so:

structure(list(Prevotella = structure(list(name = c("cellular organisms", 
"Bacteria", "FCB group", "Bacteroidetes/Chlorobi group", "Bacteroidetes", 
"Bacteroidia", "Bacteroidales", "Prevotellaceae", "Prevotella"
), rank = c("no rank", "superkingdom", "no rank", "no rank", 
"phylum", "class", "order", "family", "genus"), id = c("131567", 
"2", "1783270", "68336", "976", "200643", "171549", "171552", 
"838")), .Names = c("name", "rank", "id"), row.names = c(NA, 
-9L), class = "data.frame"), Bacteroides = structure(list(name = c("cellular organisms", 
"Bacteria", "FCB group", "Bacteroidetes/Chlorobi group", "Bacteroidetes", 
"Bacteroidia", "Bacteroidales", "Bacteroidaceae", "Bacteroides"
), rank = c("no rank", "superkingdom", "no rank", "no rank", 
"phylum", "class", "order", "family", "genus"), id = c("131567", 
"2", "1783270", "68336", "976", "200643", "171549", "815", "816"
)), .Names = c("name", "rank", "id"), row.names = c(NA, -9L), class = "data.frame"), 
    Enterobacteriaceae = structure(list(name = c("cellular organisms", 
    "Bacteria", "Proteobacteria", "Gammaproteobacteria", "Enterobacterales", 
    "Enterobacteriaceae"), rank = c("no rank", "superkingdom", 
    "phylum", "class", "order", "family"), id = c("131567", "2", 
    "1224", "1236", "91347", "543")), .Names = c("name", "rank", 
    "id"), row.names = c(NA, -6L), class = "data.frame")), .Names = c("Prevotella", 
"Bacteroides", "Enterobacteriaceae"))

I'd really like to turn things into a data frame that I can import into say phyloseq as a taxonomy table. Eg. something that that looks like:

name Phylum Class Order Family Genus

Prevotella Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae Prevotella

Bacteroides Bacteroidetes Bacteroidia Bacteroidales Bacteroidaceae Bacteroides

Enterobacteriaceae Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae

One way to do this, of course, would be to make a loop, that goes to each element of the list, finds the variable that is called, phylum, and then puts it into a new data frame. That said, I feel like there should be a faster way to apply such a transformation, using something like plyr or dplyr.

I've seen some things that seem close:

Converting nested list to dataframe

Turn a list of lists with unnamed entries into a data frame or a tibble

but they seem to assume less data that one does not want to save and evenly sized data frames for each element. Any suggestions?

ohnoplus
  • 1,205
  • 1
  • 17
  • 29

1 Answers1

2

Using dplyr and tidyr:

library(dplyr)
library(tidyr)

tibble(names = names(list), list) %>% 
  unnest() %>% 
  filter(rank %in% c("phylum","class","order","family","genus")) %>% 
  select(-id) %>% 
  spread(rank, name) %>% 
  select(name = names, phylum, class, order, family, genus)

# A tibble: 3 × 6
                name         phylum               class            order             family       genus
*              <chr>          <chr>               <chr>            <chr>              <chr>       <chr>
1        Bacteroides  Bacteroidetes         Bacteroidia    Bacteroidales     Bacteroidaceae Bacteroides
2 Enterobacteriaceae Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae        <NA>
3         Prevotella  Bacteroidetes         Bacteroidia    Bacteroidales     Prevotellaceae  Prevotella

What this does:

  1. Make a tibble with names of the lists and each nested list
  2. Unnest the lists
  3. Filter the values you want in the rank column
  4. Get rid of the id column
  5. Spread the rank rows into columns, and fill with the values from name
  6. Select the order you want, renaming names into name.
Jake Kaupp
  • 7,892
  • 2
  • 26
  • 36
  • Thanks for the suggestion @Jake Kaupp. For some reason, when I apply this to my data I am getting stuck at the unnest() portion of this exercise. If I run `list = reclass` from my data and then your suggested code, I get an error that begins "Error in bind_rows_(x, .id): Argument 1 must be a data frame or a named atomic vector, not a classification". I also get this error if I stop after the unnest() function. Am I missing something? – ohnoplus Oct 11 '17 at 19:57
  • The unnest() part misteriously fixed itself, however now filter doesn't seem to work. Specifically, `ldata <- tibble(names = names(reclass), reclass) %>% unlist()` returns a class of "character" and then when I try to filter that. `dplyr::filter(ldata, rank %in% c("phylum","class","order","family","genus"))` I get Error in UseMethod("filter_"): no applicable method for 'filter_' applied to an object of class "character". This also happens if I just run your code as written. – ohnoplus Oct 11 '17 at 20:06
  • There isn't an `unlist()` in my solution. `unnest` is VERY different than `unlist()` – Jake Kaupp Oct 11 '17 at 20:09
  • Hmm. I'm running `tridyr_0.7.0` and `dplyr_0.7.2` – ohnoplus Oct 11 '17 at 20:12
  • Ah, thats why it magically started working. Its because I went from typing unnest to typing unlist. So that worked but didn't give filter what it was expecting. So this brings me back to why unnest() isn't working. So if I run `ldata <- tibble(names = names(reclass), reclass)` and then `unnest(ldata)` R complains Error in bind_rows_(x, .id): Argument 1 must be a data frame or a named atomic vector, not a classification. But If I check the class of ldata `class(ldata)` I get 'tbl_df' 'tbl' 'data.frame' – ohnoplus Oct 11 '17 at 20:17
  • Ok. This is strange. I tried running the block of code that I produced with dsub from my nested list and everything actually worked. Thanks for solving my problem as written. The problem appeared to be when I ran the code with the stuff generated by taxize. I wonder what the difference is between the former and later case. – ohnoplus Oct 11 '17 at 20:21
  • You can probably replicate my problem if you run the first three lines of my query, rather than starting from the "and can be entered into R as so" part. It looks like the key difference is that taxize there is an extra two attributes. One that the class is "classification" and the other that the "db" type is "ncbi". – ohnoplus Oct 11 '17 at 20:22
  • Ok. I figured it out. I need to run `class(reclass) <- NULL` before running your code on it. This way, dplyr treats reclass as a list (rather than a classification object), and doesn't get confused and give up. – ohnoplus Oct 11 '17 at 20:29
  • I had trouble getting this solution to work on some data, and `class(reclass) <- NULL` didn't solve the issue. However, adding an index to the reclass object when creating the tibble DID result in working code: `tibble(names = names(reclass), reclass[1:3]) %>% ...` – filups21 Nov 02 '17 at 22:57