3

I have a data frame that's maybe best approximated as:

library(data.table)
z <- rep("z",5)
y <- c(rep("st",2),rep("co",2),"fu")
var1 <- c(rep("a",2),rep("b",2),"c")
var2 <- c("y","y","y","z","x")
transp <- c("bus","plane","train","bus","bus")
sample1 <- sample(1:10, 5)
sample2 <- sample(1:10, 5)
df <- cbind(z,y,var1,var2,transp,sample1,sample2)
df<-as.data.table(df)
> df
   z  y var1 var2 transp sample1 sample2
1: z st    a    y    bus       4       3
2: z st    a    y  plane      10       7
3: z co    b    y  train       8       9
4: z co    b    z    bus       1       5
5: z fu    c    x    bus       6       4

All unique combinations of var1 and var2 already exist in the table. I want to expand the table so that all combinations of var1/var2 include all transp options found in a list:

transtype <- c("bus","train")

Notice "plane" is an option in df but not in transtype. I would like to keep the row that includes transp="plane" but not expand by adding rows with "plane". The columns z and y need to be filled in with the appropriate value and sample1 and sample2 should be NA. Result should be:

    > result
   z  y var1 var2 transp sample1 sample2
1: z st    a    y    bus       4       3
2: z st    a    y  plane      10       7
3: z st    a    y  train      NA      NA
4: z co    b    y  train       8       9
5: z co    b    y    bus      NA      NA
6: z co    b    z    bus       1       5
7: z co    b    z  train      NA      NA
8: z fu    c    x    bus       6       4
9: z fu    c    x  train      NA      NA

The data.table options I've come up with based on Fastest way to add rows for missing values in a data.frame? and Data.table: Add rows for missing combinations of 2 factors without losing associated descriptive factors end up expanding all unique combinations of var1 and var2, not just the combinations that already exist in the table. And I don't know how to keep the values of z and y. Like this:

setkey(df, var1, var2, transp)
x<-df[CJ(var1, var2, transp, unique=T)]

Maybe I should be using dplyr? Or maybe I'm missing something simple? I went through the data.table documentation and can't come up with a solution.

Frank
  • 66,179
  • 8
  • 96
  • 180
tjr
  • 607
  • 2
  • 7
  • 20
  • so, for every unique combination of `(z, y, var1, var2)`...if there is not an instance of bus or train, then add it with `NA`? – pyll Jun 09 '17 at 13:34
  • 1
    The difficulty of carrying over z and y comes from the decision to use a single table instead of multiple. You want to carry them over because they are attributes associated with (apparently) var1. There should be a separate table containing that relation. – Frank Jun 09 '17 at 14:07

2 Answers2

4

To get only the unique combination that already exist in df, it might be better to use by to create a new reference data.table and then merge that back with the original one.

Using:

df2 <- df[, .(transp = transtype), by = .(var1,var2)]
merge(df, df2, by = c('var1','var2','transp'), all = TRUE)

gives:

   var1 var2 transp  z  y sample1 sample2
1:    a    y    bus  z st       4       3
2:    a    y  plane  z st      10       7
3:    a    y  train NA NA      NA      NA
4:    b    y    bus NA NA      NA      NA
5:    b    y  train  z co       8       9
6:    b    z    bus  z co       1       5
7:    b    z  train NA NA      NA      NA
8:    c    x    bus  z fu       6       4
9:    c    x  train NA NA      NA      NA

If you don't the z and y columns to have NA-values, you could do:

df2 <- df[, .(transp = transtype), by = .(var1,var2,z,y)]
merge(df, df2, by = c('var1','var2','transp','z','y'), all = TRUE)

which gives:

   var1 var2 transp z  y sample1 sample2
1:    a    y    bus z st       4       3
2:    a    y  plane z st      10       7
3:    a    y  train z st      NA      NA
4:    b    y    bus z co      NA      NA
5:    b    y  train z co       8       9
6:    b    z    bus z co       1       5
7:    b    z  train z co      NA      NA
8:    c    x    bus z fu       6       4
9:    c    x  train z fu      NA      NA

NOTE: If the z and y columns have more than one unique value for each var1/var2 combo, it is better to use na.locf from the zoo package to fill the NA-values in the z and y columns.


Used data:

df <- fread("z  y var1 var2 transp sample1 sample2
 z st    a    y    bus       4       3
 z st    a    y  plane      10       7
 z co    b    y  train       8       9
 z co    b    z    bus       1       5
 z fu    c    x    bus       6       4")
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • Should z and y be included in `df2` and the subsequent `by` statement? This produces the desirable output for the example, just wondering if approach should be more general...question was for tjr. – pyll Jun 09 '17 at 13:37
  • 1
    @pyll no, see the description after the first code-block in the question – Jaap Jun 09 '17 at 13:38
  • Ok, I see that now. Can you explain or point to some documentation about the `.` in your code? What is this piece doing? – pyll Jun 09 '17 at 13:40
  • 1
    @pyll the `.` is data.table idiom for list – Jaap Jun 09 '17 at 13:41
  • @pyll on 2nd thought, including `z` and `y` better resembles the desired output of OP – Jaap Jun 09 '17 at 13:42
  • Right, even though they are not needed for join condition, it is required to get values. Nice correction. – pyll Jun 09 '17 at 13:51
  • 1
    Perfect, thank you for the quick response and good clarification pyll. – tjr Jun 09 '17 at 14:03
4

Here is a solution using dplyr and tidyr, in particular, tidyr::complete and tidyr::nesting. The latter is useful to complete using the combination in the dataset, whereas complete will give you all the combinations.

library(dplyr)
library(tidyr)
df %>% 
  filter(transp %in% transtype)  %>%
  complete(nesting(z, y, var1, var2), transp) %>%
  union(df)
# A tibble: 9 <U+00D7> 7
      z     y  var1  var2 transp sample1 sample2
  <chr> <chr> <chr> <chr>  <chr>   <chr>   <chr>
1     z    st     a     y  plane      10      10
2     z    st     a     y  train    <NA>    <NA>
3     z    st     a     y    bus       1       9
4     z    fu     c     x  train    <NA>    <NA>
5     z    fu     c     x    bus       5       3
6     z    co     b     z  train    <NA>    <NA>
7     z    co     b     z    bus       6       6
8     z    co     b     y  train       3       2
9     z    co     b     y    bus    <NA>    <NA>
FlorianGD
  • 2,336
  • 1
  • 15
  • 32
  • This works as well. Always good to have two solutions for future problems of this sort. – tjr Jun 09 '17 at 14:04