4

Say I have a data.table such as: (or with numbers and NAs)

temp <- data.table(M=c(NA,T,T,F,F,F,NA,NA,F), 
                   P=c(T,T,T,F,F,F,NA,NA,NA), S=c(T,F,NA,T,F,NA,NA,NA,NA))

    M     P     S
   NA  TRUE  TRUE
 TRUE  TRUE FALSE
 TRUE  TRUE    NA
FALSE FALSE  TRUE
FALSE FALSE FALSE
FALSE FALSE    NA
   NA    NA    NA
   NA    NA    NA
FALSE    NA    NA

And I want to check if whenever a variable is NA implies that the values of a second variable are all NA as well. To check if some variables are linked to other.

For example, whenever P=NA we have also S=NA.

This code works properly for two single columns:

temp[is.na(P),all(is.na(S))]

gives TRUE

and

temp[is.na(S),all(is.na(P))]

gives FALSE because the sixth row is S=NA but P!=NA.

Now my question. I would like to generalize it, checking all pairs in my data.table and print what pairs are "linked".
I'd prefer to print only the results that are TRUE, ignoring the FALSE ones because most pairs in my real data.table won't be linked, and I have 550 variables.

I've tried this code:

temp[, lapply(.SD, function(x) temp[is.na(x), 
                 lapply(.SD, function(y)  all(is.na(y)) )]]

I get this error

Error: unexpected ']' in: "temp[, lapply(.SD, function(x) temp[is.na(x), lapply(.SD, function(y) all(is.na(y)) )]]"

I could try with a for loop but I'd prefer the typical data.table syntax. Any suggestion is welcome.

I would also like to know how to refer to two different .SD when you are nesting data.table calls.

skan
  • 7,423
  • 14
  • 59
  • 96

2 Answers2

5

For combinations in pairs, crossprod seems yet useful.

We only care for whether a value is NA or not:

NAtemp = is.na(temp)

Compare the co-existence of NAs:

crossprod(NAtemp)
#  M P S
#M 3 2 2
#P 2 3 3
#S 2 3 5

with the number of NA per column:

colSums(NAtemp)
#M P S 
#3 3 5

like:

ans = crossprod(NAtemp) == colSums(NAtemp)
ans
#      M     P     S
#M  TRUE FALSE FALSE
#P FALSE  TRUE  TRUE
#S FALSE FALSE  TRUE

And use the convenient as.data.frame.table to format:

subset(as.data.frame(as.table(ans)), Var1 != Var2)
#  Var1 Var2  Freq
#2    P    M FALSE
#3    S    M FALSE
#4    M    P FALSE
#6    S    P FALSE
#7    M    S FALSE
#8    P    S  TRUE
alexis_laz
  • 12,884
  • 4
  • 27
  • 37
  • Wouldn't it be possible to do it all with data.table? (for speed reasons) – skan Dec 28 '16 at 00:54
  • @skan : I believe the only thing that can be impoved (regarding efficiency) is the last line -- e.g. something like `ij = which(col(ans) != row(ans), TRUE); data.frame(v1 = rownames(ans)[ij[, "row"]], v2 = colnames(ans)[ij[, "col"]], val = ans[ij])` might avoid some overhead. `crossprod` offers a very convenient algorithm to compute all pairwise combinations efficiently on matrices and, also, the 550 * 550 resulting matrix results in no memory concerns. Have you found any bottleneck on a benchmark? – alexis_laz Dec 28 '16 at 01:16
  • OK, tomorrow I will apply your answer and akrun one to my big dataset and do a benchmark and decide who is the winner. – skan Dec 28 '16 at 01:53
  • You solution is very fast. I'm only afraid the crossprod doesn't scale well on memory. – skan Dec 28 '16 at 15:07
  • 1
    @skan : Indeed, all these matrix operations might run out of memory though, in your case of 550 columns, I'd say there should be no problem. Alternatively, you could try a sparse alternative of the above -- `library(Matrix); sNAtemp = as(NAtemp, "sparseMatrix"); crossprod(sNAtemp) == colSums(sNAtemp)` – alexis_laz Dec 28 '16 at 15:46
  • I have few columns but a lot of rows (more than 1 million) I guess the final matrix is not a problem, the problem could be intermediate calculation with the crossproduct. – skan Dec 28 '16 at 21:07
  • @skan : `crossprod` allocates only a `ncol(NAtemp) * ncol(NAtemp)` structure (here, 550 * 550) with no significant intermediate allocations. So, I guess, the only time `crossprod` would run out of memory is if the number of columns is large; in which case, storing sparse matrices should help as the result stores only half of the symmetric matrix. – alexis_laz Dec 28 '16 at 23:29
  • What if I want to further restrict the variables to be used as first variable and the variables to be used as second variable? How can I provide two vectors with their names to your code? Or maybe it's easier to filter the results later. – skan Dec 28 '16 at 23:59
  • 1
    @skan : If I understand correctly, assuming `temp$X = c(TRUE, TRUE, TRUE, NA, NA, FALSE, TRUE, FALSE, NA); temp$Y = c(FALSE, NA, NA, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE)`, you could use, say, `NAtemp = is.na(temp); crossprod(NAtemp[, c("M", "S", "X")], NAtemp[, c("P", "Y")])` and compare with `colSums(NAtemp[, c("M", "S", "X")])` – alexis_laz Dec 29 '16 at 10:23
  • I've found a little "error" in your basic algorithm. If a column doesn't contain any NA it shouldn't appear as principal variable but it appears. In order to avoid it you need to do: **ans = (crossprod(NAtemp) == colSums(NAtemp)) & (colSums(NAtemp)!=0)** – skan Jan 04 '17 at 00:33
  • I've thought of a variation. I could assign groups to the variables and let the variables to combine only with variables in other groups (useful when you have many). For example if I want combinations of M with S, and P with S I can define mycomb=c(1,1,2) and use **ans = (crossprod(NAtemp) == colSums(NAtemp)) & (colSums(NAtemp)!=0) & outer(ccc,ccc,"!=")**. Though there should be a faster method filtering before the crossprod. – skan Jan 04 '17 at 01:24
  • 1
    @skan : You're right about the case where no NA is in a column; I missed it. Regarding the combinations you need, will there always be two groups, or more and you need a pairwise comparison of the groups of columns? (I'm assuming `ccc` is `mycomb`?) Probably the fastest way would be to subset `NAtemp` before passing it to `crossprod`/`colSums`. For example here, I guess you could use `spl = split(colnames(NAtemp), mycomb); crossprod(NAtemp[, spl[[1]]], NAtemp[, spl[[2]], drop = FALSE]) == colSums(NAtemp[, spl[[1]]])` etc.. – alexis_laz Jan 04 '17 at 11:44
  • More than 2 groups but my example here is very small. Any other idea to restrict combinations is wellcome – skan Jan 05 '17 at 00:07
  • excuse I bother you again. What if you want to check if var1=(NA or 0) implies var2=NA ? (add the zero comparison for the first). – skan Jan 08 '17 at 20:18
  • @skan : I guess, the most straightforward way is to replace any 0, in the appropriate columns, with `NA` at start? – alexis_laz Jan 09 '17 at 09:19
  • At first I also thought that but it would affect not only to zeroes on the first variables but also on the second ones. I only want to consider zeroes on the first ones. – skan Jan 09 '17 at 11:44
  • 1
    @skan : I think a simple `is.na(temp[, vars_to_change]) = !temp[, vars_to_change]` -where `vars_to_change` is a vector containing the columns names to replace 0 with `NA`- should suffice. (I don't think the previous syntax is valid for "data.table"s, though; it might need adjustments or, you could coerce your data to a "matrix" if it's only a TRUE/FALSE/NA structure) – alexis_laz Jan 09 '17 at 12:15
  • Is your crossprod method easily generalizable to other comparisons? For example: First var different to NA implies second var different to NA in at least one row. I guess this example would be crossprod(!NAtemp) >= 1, but other more complex could become difficult. – skan Jan 15 '17 at 00:13
  • I'm not sure, but having `crossprod` that captures all cooccurences in pairs, I think that some combination of relational operations with zero and/or `colSums` should provide a way. Could you provide a specific example that is problematic in the current setup? – alexis_laz Jan 15 '17 at 19:47
  • @skan : Sorry, forgot to notify you for the previous comment – alexis_laz Jan 16 '17 at 09:56
3

We can try with combn

unlist(combn(names(temp), 2, FUN = function(nm)
  list(setNames(temp[is.na(get(nm[1])), all(is.na(get(nm[2])))], paste(nm, collapse="-"))))) 
#   M-P   M-S   P-S 
# FALSE FALSE  TRUE 

Or if we also need all the combinations

d1 <- CJ(names(temp), names(temp))[V1!=V2]
d1[,  .(index=temp[is.na(get(V1)), all(is.na(get(V2)))]) , .(V1, V2)]
#    V1 V2 index
#1:  M  P FALSE
#2:  M  S FALSE
#3:  P  M FALSE
#4:  P  S  TRUE
#5:  S  M FALSE
#6:  S  P FALSE
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Your solution is very compact and uses data.table but I don't know why when the data have many columns (mine have 500 but you can see it even with only 50) @alexis_laz solution is orders of magnitude faster. – skan Dec 28 '16 at 15:00
  • And I think your first code combn(names(temp)) needs to be repeated with the variable order reversed too. – skan Jan 15 '17 at 13:11