Complex reorganisation of a data frame in R

Question

I've got some results of a DNA sequencing procedure that are not presented in a particularly helpful manner. Currently, the first column my data frame has the species name and the remaining columns contain all the plate wells where DNA from this species was detected. Here is what a sample of the data table looks like:

Species                 V1      V2      V3      V4      V5
Eupeodes corollae       1-G3    1-F1    1-E11   1-C10   1-A3
Diptera                 1-A10   1-B2    1-C1    1-G7    1-E11
Episyrphus balteatus    2-C3    2-A10   1-C11   1-A10   2-B4
Aphidie                 1-B9    1-D7    2-A3    1-C8    2-C11
Ericaphis               1-B9    1-D7    2-A3    1-C8    2-C11
Hemiptera               1-B9    1-D7    2-A3    1-C8    2-C11

Ultimately, I would like to end up with a data frame where the first column contains all plate wells and the remaining columns contain all the species that were identified in each well, like this:

Well  Species1              Species2                Species3
1-A1  Eupeodes corollae     Ericaphis
1-A2  Episyrphus balteatus  
1-A3  Aphidie        
1-A4  Hemiptera             Episyrphus balteatus    Aphidie
1-A5  Diptera

I'm guessing that this will be a two-step procedure where the table is first re-shaped to long format with a new record for each well-species match and then a second stage where the records are consolidated so that each well appears just once in the first column and all species found in that well are listed next to the well name. However, I'm afraid this sort of complex reshaping is beyond my capabilites in R. Can anyone advise on how I might go about doing this?

Adam Bethke · Answer 1 · 2017-10-13T16:30:01.673

Your thoughts are pretty spot on, and there's a great set of packages to do that pretty quickly.

In the tidyverse packages, the actions you're describing are encapsulated in functions called gather and spread, respectively. There's a really cool cheatsheet produced by R Studio which covers these kinds of data wrangling activities.

The trick with your data are that, normally, spread expects there to be a unique set of columns. The good news is you can get around this one of two ways:

1. Create a placeholder variable for the new unique columns, and spread using the placeholder as the key

    library(tidyr)
    library(dplyr)

    output <- 
        input %>%
        # bring all of the data into a long table
        gather(Plate, Well, V1:V5) %>%
        # remove the column with the old column names,
        # this column will cause problems in spread if not removed
        select(-Plate) %>% 
        # create the placeholder variable
        group_by(Well) %>%
        mutate(NewColumn = seq(1, n())) %>%
        # spread the data out based on the new column header
        spread(NewColumn, Species)

Depending on the use and whether you need it to, you could rename the header column before, or after, the spread function.

OR:

2. Alter the desired output slightly, to give you one column per species

    library(tidyr)
    library(dplyr)

    output <- 
        input %>%
        # bring all of the data into a long table
        gather(Plate, Well, V1:V5) %>%
        # remove the column with the old column names,
        # this column will cause problems in spread if not removed
        select(-Plate) %>% 
        # count the number of unique combinations of well and species
        count(Well, Species) %>%
        # spread out the counts
        # fill = 0 sets the values where no combinations exist to 0
        spread(Species, n, fill = 0)

This gives you a different output, but I mentioned it because it makes it easier to see if there are multiple instances of the same dataset (e.g. two of the same species), and sets the data up nicely for future analyses.

Reproducible data for reference:

input <- tibble(
    Species = c(
        "Eupeodes corollae",
        "Diptera",
        "Episyrphus balteatus",
        "Aphidie",
        "Ericaphis",
        "Hemiptera"
    ),
    V1 = c("1-G3 ", "1-A10", "2-C3", "1-B9", "1-B9", "1-B9"),
    V2 = c("1-F1", "1-B2", "2-A10", "1-D7", "1-D7", "1-D7"),
    V3 = c("1-E11", "1-C1" , "1-C11", "2-A3", "2-A3", "2-A3"),
    V4 = c("1-C10", "1-G7", "1-A10", "1-C8", "1-C8", "1-C8"),
    V5 = c("1-A3", "1-E11", "2-B4", "2-C11", "2-C11", "2-C11")
)

That cheatsheet is really useful for more straightforward problems and your comprehensive answer has been very helpful. I particularly appreciate the explanation in the first method. I've now been able to produce the table I was after thanks to that method. — D.Hodgkiss, Oct 13 '17 at 18:38

Complex reorganisation of a data frame in R

1 Answers1