is there an elegant way to use combination of "Primary Key value's" in a dataset to calculate values for combined Key values?

Question

NOTE: this is not a time-serious problem that can be easily solved with zoo (or least, I do not understand how to zoo this problem :( )

I have a data set with a number of "key columns" and a value associated for combinations where only one of the key columns are set. vales for rows were more than one key column is set can be calculated based on the "one key column set" rows.

Using normal programming techniques, this is fairly simple (yet messy) to do as is shown below. I am hoping there is a nicer and more elegant way to do this in R.

In this example, I have three key's, for a combination key value e.g. [1,1,0] = I would calculate the value based on the two primary keys Val[1,0,0] and Val[0,1,0]. In this example where I use a simple mean this is mean(2,5) = 3.5.

myMatrix <- tribble(
  ~`1`, ~`2`, ~`3`, ~Val,
  0,0,0,1,
  1,0,0,2,
  2,0,0,2,
  0,1,0,5,
  1,1,0,NA,
  2,1,0,NA,
  0,2,0,6,
  1,2,0,NA,
  2,2,0,NA,
  0,0,1,1,
  1,0,1,NA,
  2,0,1,NA,
  0,1,1,NA,
  1,1,1,NA,
  2,1,1,NA,
  0,2,1,NA,
  1,2,1,NA,
  2,2,1,NA
  )

#Filter for NA in the Val col
tmpNARows <- myMatrix %>% filter(is.na(Val)) %>% select(-Val)
#Take the 
tmpFirstRow <- TRUE
for (myR in 1:nrow(tmpNARows)) {
  #For each row in the NA table
  tmpMyNARow<-tmpNARows[myR,]
  tmpFirstElement <- TRUE
  for (myC in 1:ncol(tmpMyNARow)) {
    #find the records that make up this one's parts 
    #ignore columns with value 0
    if (0 != tmpMyNARow[myC]) { 
      #Make Base Record for lookup
      tmpMyBaseRow <- tmpMyNARow
      for (myC2 in 1:ncol(tmpMyNARow)) {
        if (myC2!=myC) { tmpMyBaseRow[myC2] <- 0 }
      }
      if(tmpFirstElement == TRUE) {
        #Make a new Base table
        tmpMyBaseTable <- tmpMyBaseRow       
        tmpFirstElement <- FALSE
      } else {
        #Append the Base row to the Base table
        tmpMyBaseTable <- union(tmpMyBaseTable, tmpMyBaseRow)
      }
    }
  }
  #Calculate the mean and store in as Val
  tmpVal <- (left_join(tmpMyBaseTable, myMatrix) %>% summarise(mean(Val)))[[1]]
  tmpMyNARowWithVal <- tmpMyNARow %>% mutate(Val = tmpVal)
  if (tmpFirstRow == TRUE) {
    tmpMyResultMatrix <- tmpMyNARowWithVal
    tmpFirstRow <- FALSE
  } else {
    tmpMyResultMatrix <- union(tmpMyResultMatrix,tmpMyNARowWithVal)
  }
}
#filter for non NA
tmpNonNARows <- myMatrix %>% filter(!is.na(Val))
#Add the calculated rows
myCalculatedMatrix <- union(tmpNonNARows, tmpMyResultMatrix)

#lets have a look
myCalculatedMatrix
#the (1,1,0) element is indeed 3.5 so it appears to be working.

The expected result should look like

myCalculatedMatrix %>% arrange_all()

# A tibble: 18 x 4
     `1`   `2`   `3`      Val
   <dbl> <dbl> <dbl>    <dbl>
 1     0     0     0 1.000000
 2     0     0     1 1.000000
 3     0     1     0 5.000000
 4     0     1     1 3.000000
 5     0     2     0 6.000000
 6     0     2     1 3.500000
 7     1     0     0 2.000000
 8     1     0     1 1.500000
 9     1     1     0 3.500000
10     1     1     1 2.666667
11     1     2     0 4.000000
12     1     2     1 3.000000
13     2     0     0 2.000000
14     2     0     1 1.500000
15     2     1     0 3.500000
16     2     1     1 2.666667
17     2     2     0 4.000000
18     2     2     1 3.000000

@Sotos. I believe my question is not a duplicate and have included additional language to explain why. Could you kindly remove your duplicate mark. — Sylvain, Aug 10 '17 at 14:22

Uwe · Answer 1 · 2017-08-12T00:08:31.420

Although the question is explicitely tagged with dplyr I've started with a data.table solution, which I hope is more "elegant". A least it avoids the nested for loops.

Edit: I've added a dplyr/tidyr version of the data.table approach.

The OP has a data set with a number of "key columns" and a value associated for combinations where only one of the key columns is set. Then there is a second data set where more than one key column is set and the value is missing. The task is calculate the missing values based on the "one key column set" rows of the first data set.

Unfortunately, the given data myMatrix contains a mix of both data sets which adds to the complexity of the question.

`data.table` solution

library(data.table)

# convert to data.table, add column with row numbers for subsequent join
DT <- data.table(myMatrix)[, rn := .I]
# reshape from wide to long format, 
# rename column using a self-explanatory name
DT_long <- melt(DT, id.vars = c("rn", "Val"), na.rm  = TRUE, value.name = "key")
# extract primary keys
primary_keys <- DT_long[!is.na(Val) & key > 0]
primary_keys

   rn Val variable key
1:  2   2        1   1
2:  3   2        1   2
3:  4   5        2   1
4:  7   6        2   2
5: 10   1        3   1

# right join to keep all rows in DT_long
result <- primary_keys[DT_long, on = c("variable", "keys")][
  # calculate new Val by aggregating row-wise
  , .(calcVal = mean(c(Val, i.Val), na.rm = TRUE)), by = .( rn = i.rn)]        
result

    rn  calcVal
 1:  1 1.000000
 2:  2 2.000000
 3:  3 2.000000
 4:  4 5.000000
 5:  5 3.500000
 6:  6 3.500000
 7:  7 6.000000
 8:  8 4.000000
 9:  9 4.000000
10: 10 1.000000
11: 11 1.500000
12: 12 1.500000
13: 13 3.000000
14: 14 2.666667
15: 15 2.666667
16: 16 3.500000
17: 17 3.000000
18: 18 3.000000

# join calculated values with original table, remove row numbers as no longer needed
result <- result[DT, on = "rn"][, rn := NULL][]

# beautify result for easier comparison
result[, setcolorder(.SD, c(names(myMatrix), "calcVal"))][, setorderv(.SD, names(.SD))]

    1 2 3 Val  calcVal
 1: 0 0 0   1      NaN
 2: 0 0 1   1 1.000000
 3: 0 1 0   5 5.000000
 4: 0 1 1  NA 3.000000
 5: 0 2 0   6 6.000000
 6: 0 2 1  NA 3.500000
 7: 1 0 0   2 2.000000
 8: 1 0 1  NA 1.500000
 9: 1 1 0  NA 3.500000
10: 1 1 1  NA 2.666667
11: 1 2 0  NA 4.000000
12: 1 2 1  NA 3.000000
13: 2 0 0   2 2.000000
14: 2 0 1  NA 1.500000
15: 2 1 0  NA 3.500000
16: 2 1 1  NA 2.666667
17: 2 2 0  NA 4.000000
18: 2 2 1  NA 3.000000

Note that the data.table code above has been written to explain the processing steps. Re-writing the code using more chaining would make it more concise as some intermediate results could be skipped.

`dplyr` / `tidyr` solution

The code below is a "translation" of the data.table solution:

library(dplyr)
library(tidyr)

tmpMatrix <- myMatrix %>% 
  mutate(rn = row_number()) 
tmpLong <- tmpMatrix  %>% 
  gather(Col, Keys, -Val, -rn) %>% 
  print()
tmpPrimKeys <- tmpLong %>% 
  filter(!is.na(Val) & Keys > 0) %>% 
  select(-rn) %>% 
  print()   
tmpLong %>% 
  left_join(tmpPrimKeys, by = c("Col", "Keys")) %>% 
  group_by(rn) %>% 
  summarise(calcVal = mean(c(Val.x, Val.y), na.rm = TRUE)) %>% 
  inner_join(tmpMatrix, by = "rn") %>% 
  select(num_range("", 1:3), Val, calcVal) %>% 
  arrange_all()

# A tibble: 18 x 5
     `1`   `2`   `3`   Val  calcVal
   <dbl> <dbl> <dbl> <dbl>    <dbl>
 1     0     0     0     1 1.000000
 2     0     0     1     1 1.000000
 3     0     1     0     5 5.000000
 4     0     1     1    NA 3.000000
 5     0     2     0     6 6.000000
 6     0     2     1    NA 3.500000
 7     1     0     0     2 2.000000
 8     1     0     1    NA 1.500000
 9     1     1     0    NA 3.500000
10     1     1     1    NA 2.666667
11     1     2     0    NA 4.000000
12     1     2     1    NA 3.000000
13     2     0     0     2 2.000000
14     2     0     1    NA 1.500000
15     2     1     0    NA 3.500000
16     2     1     1    NA 2.666667
17     2     2     0    NA 4.000000
18     2     2     1    NA 3.000000

is there an elegant way to use combination of "Primary Key value's" in a dataset to calculate values for combined Key values?

NOTE: this is not a time-serious problem that can be easily solved with zoo (or least, I do not understand how to zoo this problem :( )

1 Answers1

data.table solution

dplyr / tidyr solution

`data.table` solution

`dplyr` / `tidyr` solution