0

Apologies if this is a simple issue. I have data which is tidy (long) formatted. I wish to see what the differences in the set of values in Factor Name are for each sample in Sample Name. I believe its possible with the group_by function.

# Groups:   Sample Name
  `Sample Name` `Factor Name`    mean
   <fct>         <fct>           <dbl>
 1 S1            ABCD            -5.15
 2 S1            EFGH             7.74
 3 S1            IJKL            -7.43
 4 S2            ABCD             4.35
 5 S2            EFGH            -2.15
 6 S2            IJKL             2.33
 7 S3            ABCD             5.53
 8 S3            EFGH             2.84
 9 S3            IJKL             1.61
10 S3            MNOP             NaN   

I've also tried aggregate and while it gives an output I would prefer a group_by or pipe efficient method.

Aggregate(`Factor Name` ~ `Sample Name`, df, FUN= function(x) setdiff(unique(df$`Factor Name`),x))

Also if possible I would like to be able to add the missing Factor Name for each Sample name like so:

# Groups:   Sample Name
  `Sample Name` `Factor Name`    mean
   <fct>         <fct>           <dbl>
 1 S1            ABCD            -5.15
 2 S1            EFGH             7.74
 3 S1            IJKL            -7.43
 4 S1            MNOP             NaN
 5 S2            ABCD             4.35
 6 S2            EFGH            -2.15
 7 S2            IJKL             2.33
 8 S2            MNOP             NaN
 9 S3            ABCD             5.53
10 S3            EFGH             2.84
11 S3            IJKL             1.61
12 S3            MNOP             NaN   
  • Possible duplicate of [Fill missing values in data.frame using dplyr complete within groups](https://stackoverflow.com/questions/42866119/fill-missing-values-in-data-frame-using-dplyr-complete-within-groups) – BENY May 09 '18 at 15:48

1 Answers1

1

The tidyr::expand and tidyr::compelete functions come in handy for what you are trying to achieve.

Load packages:

library(dplyr)
library(tidyr)

Create a dummy dataset:

df <- data_frame(sample_name = factor(c(rep(c('S1', 'S2', 'S3'), each = 3), 'S3')),
                 factor_name = factor(c(rep(c('ABCD', 'EFGH', 'IJKL'), 3), 'MNOP')),
                 mean = rnorm(n = 10, sd = 10))

Question 1

Get differences in the set of values in factor_name for each sample in sample_name:

# Return ONLY those levels of sample_name that are missing a level of factor_name
df %>% 
    # Expand to all unique combinations
    expand(sample_name, factor_name) %>% 
    # Extract the difference
    setdiff(., select(df, -mean)) 

#> # A tibble: 2 x 2
#>   sample_name factor_name
#>   <fct>       <fct>      
#> 1 S1          MNOP       
#> 2 S2          MNOP

# Return ALL levels of sample_name, along with any missing levels of factor_name
df %>% 
    # Expand to all unique combinations
    expand(sample_name, factor_name) %>% 
    # Extract the difference
    setdiff(., select(df, -mean)) %>% 
    # Expand to show all levels of sample_name
    complete(sample_name)

#> # A tibble: 3 x 2
#>   sample_name factor_name
#>   <fct>       <fct>      
#> 1 S1          MNOP       
#> 2 S2          MNOP       
#> 3 S3          <NA>

Question 2

Add the missing factor_name for each sample_name:

# Expand to include ALL levels of factor_name within sample_name
df %>% 
    complete(sample_name, factor_name) 

#> # A tibble: 12 x 3
#>    sample_name factor_name     mean
#>    <fct>       <fct>          <dbl>
#>  1 S1          ABCD         16.6   
#>  2 S1          EFGH         -0.0803
#>  3 S1          IJKL          4.80  
#>  4 S1          MNOP         NA     
#>  5 S2          ABCD          3.80  
#>  6 S2          EFGH         -1.24  
#>  7 S2          IJKL          1.50  
#>  8 S2          MNOP         NA     
#>  9 S3          ABCD         -5.94  
#> 10 S3          EFGH         10.4   
#> 11 S3          IJKL        -14.3   
#> 12 S3          MNOP         -6.87

Created on 2018-05-10 by the reprex package (v0.2.0).

Peter K
  • 706
  • 5
  • 8