2

I'm new to r and have a complicated set of data so hope my explanation is correct. I have multiple data frames I need to use to perform a series of things. Here's one example. I have three data frames. One is a list of species names and corresponding codes:

>df.sp
    Species Code
    Picea   PI
    Pinus   CA

Another is a list of sites with species abundance data for different locations (dir). Unfortunately, the order of the species are different.

>df.site
Site  dir total  t01 t02 t03 t04
2         Total   PI  CA  AB  T
2     N    9      1   5   na na
2                 AB  ZI PI CA
2     S    5     2   2  1  4
3                 DD  EE AB YT
3     N    6     1   1  5   3
3                 AB YT  EE  DD
3     S     5     4   3  1   1

Then I also have a data frame of traits corresponding to the species:

>df.trait
Species  leaft  rootl
Picea     0.01    1.2
Pinus     0.02    3.5

An example of one things I want to do is get the average value for each trait (df.trait$leaft and df.trait$rootl) for all the species per site (df.site$Site) and per site location (df.site$Site N, S). So the result would be for the first row:

Site dir leaft rootl
2    N   0.015  2.35

I hope that makes sense. It is very complicated for me to think through how to go about. I've attempted working from this post and this (and many others) but got lost. Thanks for the help. Really appreciated.

UPDATE: Here is a sample of the actual df.site (reduced) using dput:

> dput(head(df.site))
structure(list(Site = c(2L, 2L, 2L, 2L, 2L, 2L), dir = c("rep17316", 
"N", "", "S", "", "SE"), total = c("Total", "9", "", 
"10", "", "9"), t01 = c("PI", "4", "CA", "1", "SILLAC", 
"3"), t02 = c("CXBLAN", "3", "ZIZAUR", "4", "OENPIL", "2"), 
    t03 = c("ZIZAPT", "1", "ECHPUR", "2", "ASCSYR", "2")), .Names = c("site", "dir", "total", "t01", "t02", "t03"), row.names = 2:7, class = "data.frame")
KNN
  • 459
  • 4
  • 19
  • 2
    `df.site` looks to be in really bad shape where the columns don't have consistent type (e.g., the `total` column has words like `Total` and numbers like `9`, the `t04` column has words like `"T"` (unless that's a boolean TRUE?), strings like `"na"` that should probably be missing values `NA`, numbers like `4` and regular strings like `"CA"`. The first step will definitely need to be getting that data into shape. Could you share a subset of that data frame with `dput` so it is copy/pasteable and we can see all the column types? (Sharing the others with `dput()` would also be nice...) – Gregor Thomas Feb 12 '19 at 03:59
  • 3
    See [this excellent FAQ](https://stackoverflow.com/q/5963269/903061) if you need help making a reproducible example. `dput(droplevels(df.site[1:10, ]))` is probably good. – Gregor Thomas Feb 12 '19 at 04:00
  • 1
    A better question, however, might be to go back to however you imported `df.site` and fix the import process instead of trying to fix the messed up data that resulted.... without seeing a sample of the source it's hard to know if that would be less work or not. – Gregor Thomas Feb 12 '19 at 04:03
  • Thank you. Yes, the df.site is a nightmare, which is part of the problem. It was imported correctly, it is just a difficult csv file to work with. I added the reduced version using dput as suggested. – KNN Feb 12 '19 at 16:15
  • Not related to the R, but wouldn't your example `Species` column be more appropriately named `genus`? :) – DanTan Feb 12 '19 at 16:24

1 Answers1

3

You're going to have to first wrangle your data into a much cleaner form. I'm assuming the structure that you dput above is consistent throughout your df.site dataframe; namely that rows are paired, the first of which specifies the species code, the second of which has a count (or some other collected data?).

Starting with df as the dataframe that you dput() above, I'll first simulate some data for the other two dataframes:

df.sp <- data.frame(Species = paste0("species",1:8),
                    Code = c("ECHPUR", "CXBLAN", "ZIZAPT",
                             "CAMROT", "SILLAC", "OENPIL",
                             "ASCSYR", "ZIZAUR"))
df.sp
#>    Species   Code
#> 1 species1 ECHPUR
#> 2 species2 CXBLAN
#> 3 species3 ZIZAPT
#> 4 species4 CAMROT
#> 5 species5 SILLAC
#> 6 species6 OENPIL
#> 7 species7 ASCSYR
#> 8 species8 ZIZAUR

df.trait <- data.frame(Species = paste0("species",1:8),
                       leaft = round(runif(8, max=.2), 2),
                       rootl = round(runif(8, min=1, max=4),1))

df.trait
#>    Species leaft rootl
#> 1 species1  0.12   2.5
#> 2 species2  0.04   2.6
#> 3 species3  0.12   2.1
#> 4 species4  0.05   1.1
#> 5 species5  0.15   2.5
#> 6 species6  0.15   3.3
#> 7 species7  0.05   3.9
#> 8 species8  0.13   2.1

First, let's clean up df by moving these second rows containing collected data, and moving those values into a new set of columns:

library(dplyr)

df.clean <- df %>% 
  #for each row, copy the direction and total from the following row
  mutate_at(vars(matches("dir|total")), lead) %>% 
  #create new columns for observed data and fill in values from following row
  mutate_at(vars(matches("t\\d+$")), 
            .funs = funs(n = lead(.))) %>% 
  #filter to rows with species code in t01
  filter(t01 %in% df.sp$Code) %>% 
  #drop "total" column (doesn't make sense after reshape)
  select(-total)

df.clean
#>   site dir    t01    t02    t03 t01_n t02_n t03_n
#> 1    2   N ECHPUR CXBLAN ZIZAPT     4     3     1
#> 2    2   S CAMROT ZIZAUR ECHPUR     1     4     2
#> 3    2  SE SILLAC OENPIL ASCSYR     3     2     2

We now have two sets of corresponding columns which have species codes and values respectively. To reshape the dataframe into long form we'll use the melt() from the data.table package. See the responses to this question for other examples of how to do this.

library(data.table)

df.clean <- df.clean %>% 
  setDT() %>% #convert to data.table to use data.tabel::melt
  melt(measure.vars = patterns("t\\d+$", "_n$"),
       value.name = c("Code", "Count") ) %>% 
  #drop "variable" column, which isn't needed
  select(-variable)

Finally, join your three dataframes:

#merge tables together
df.summaries <- df.clean %>% 
  left_join(df.sp) %>% 
  left_join(df.trait)

At this point you should be able to summarize your data by whatever groupings you are interested in using group_by and summarise.

DanTan
  • 660
  • 7
  • 17
  • 1
    Wow. Thank you for spending so much time trying to help me! This is really, really helpful. I've been struggling for longer than I'd like to admit trying to get this into useable format. One huge step closer to what I need to do. Thanks, – KNN Feb 12 '19 at 23:25