3

I would like to 'summarise' a factor variable in R, so that for each record I know what factor levels are present.

Here is a simplified example dataframe:

df <- data.frame(record= c("a","a","b","c","c","c"),
species = c("COD", "SCE", "COD", "COD","SCE","QSC"))

record species
     a     COD
     a     SCE
     b     COD
     c     COD
     c     SCE
     c     QSC

And this is what I am trying to achieve:

data.frame(record= c(a,b,c), species = c("COD, SCE", "COD", "COD, SCE, QSC"))

    record       species
        a       COD, SCE
        b            COD
        c  COD, SCE, QSC

This is the closest I have been able to get, but it puts ALL levels of the factor with each record, rather than just the ones that should be present for each record.

summarise(group_by(df, record),
          species = (paste(levels(species), collapse="")))
record   species
   <fctr>   <chr>
      a CODQSCSCE      <- this should be CODSCE
      b CODQSCSCE      <- this should just be COD
      c CODQSCSCE      <- this is correct as CODQSCSCE as it has all levels

tapply returns the same issue

tapply(df$species, df$record, function(x) paste(levels(x), collapse=""))
   a           b           c 
"CODQSCSCE" "CODQSCSCE" "CODQSCSCE" 

I need to find a way to tell which combinations of species factors are present for each record.

M--
  • 25,431
  • 8
  • 61
  • 93
Shep
  • 41
  • 1
  • 1
  • 5
  • What should be the desired solution if there is another row for a that once again has 'COD' for site? Should COD be listed only once or twice? – Andrew Taylor Jun 28 '17 at 13:29

1 Answers1

10

Use unique():

library(dplyr)
df %>% 
    group_by(site) %>% 
    summarise(species = paste(unique(species), collapse = ', '))


# A tibble: 3 x 2
    site       species
  <fctr>         <chr>
1      a      COD, SCE
2      b           COD
3      c COD, SCE, QSC
Andrew Taylor
  • 3,438
  • 1
  • 26
  • 47