2

I want to select columns from a data frame so that the resulting continuous column-sequences are as long as possible, while the number of rows with NAs is as small as possible, because they have to be dropped afterwards.

(The reason I want to do this is, that I want to run TraMineR::seqsubm() to automatically get a matrix of transition costs (by transition probability) and later run cluster::agnes() on it. TraMineR::seqsubm() doesn't like NA states and cluster::agnes() with NA states in the matrix doesn't necessarily make much sense.)

For that purpose I already wrote a working function that computes by principle all possible power-subsets and checks them for NAs. It works well with this toy data d which represents a 10x5 matrix:

> d
   id X1 X2 X3 X4 X5
1   A  1 11 21 31 41
2   B  2 12 22 32 42
3   C  3 13 23 33 NA
4   D  4 14 24 34 NA
5   E  5 15 25 NA NA
6   F  6 16 26 NA NA
7   G  7 17 NA NA NA
8   H  8 18 NA NA NA
9   I  9 NA NA NA NA
10  J 10 NA NA NA NA
11  K NA NA NA NA NA

The problem now is that I actually want to apply the algorithm to survey data that would represent a 34235 x 17 matrix!

My code has been reviewed on Code Review, but still cannot be applied to the real data.

I am aware that with this approach there would be a huge calculation. (Presumably too huge for non-supercomputers?!)

Does anyone know a more suitable approach?

I show you the already enhanced function by @minem from Code Review:

seqRank2 <- function(d, id = "id") {
  require(matrixStats)

  # change structure, convert to matrix
  ii <- as.character(d[, id])
  dm <- d
  dm[[id]] <- NULL
  dm <- as.matrix(dm)
  rownames(dm) <- ii

  your.powerset = function(s){
    l = vector(mode = "list", length = 2^length(s))
    l[[1]] = numeric()
    counter = 1L
    for (x in 1L:length(s)) {
      for (subset in 1L:counter) {
        counter = counter + 1L
        l[[counter]] = c(l[[subset]], s[x])
      }
    }
    return(l[-1])
  }

  psr <- your.powerset(ii)
  psc <- your.powerset(colnames(dm))

  sss <- lapply(psr, function(x) {
    i <- ii %in% x
    lapply(psc, function(y) dm[i, y, drop =  F])
    })

  cn <- sapply(sss, function(x)
    lapply(x, function(y) {

      if (ncol(y) == 1) {
        if (any(is.na(y))) return(NULL)
          return(y)
        }

      isna2 <- matrixStats::colAnyNAs(y)
      if (all(isna2)) return(NULL)
      if (sum(isna2) == 0) return(NA)
      r <- y[, !isna2, drop = F]
      return(r)
      }))

  scr <- sapply(cn, nrow)
  scc <- sapply(cn, ncol)

  namesCN <- sapply(cn, function(x) paste0(colnames(x), collapse = ", "))
  names(scr) <- namesCN
  scr <- unlist(scr)

  names(scc) <- namesCN
  scc <- unlist(scc)

  m <- t(rbind(n.obs = scr, sq.len = scc))
  ag <- aggregate(m, by = list(sequence = rownames(m)), max)
  ag <- ag[order(-ag$sq.len, -ag$n.obs), ]
  rownames(ag) <- NULL
  return(ag)
}

Yielding:

> seqRank2(d)
         sequence n.obs sq.len
1  X1, X2, X3, X4     4      4
2      X1, X2, X3     6      3
3      X1, X2, X4     4      3
4      X1, X3, X4     4      3
5      X2, X3, X4     4      3
6          X1, X2     8      2
7          X1, X3     6      2
8          X2, X3     6      2
9          X1, X4     4      2
10         X2, X4     4      2
11         X3, X4     4      2
12             X1    10      1
13             X2     8      1
14             X3     6      1
15             X4     4      1
16             X5     2      1

> system.time(x <- seqRank2(d))
   user  system elapsed 
   1.93    0.14    2.93 

In this case I would choose X1, X2, X3, X4, X1, X2, X3 or X2, X3, X4 because they're continuous and yield an appropriate number of observations.

Expected output:

So for toy data d the expected output would be something like:

> seqRank2(d)
sequence n.obs sq.len
1  X1, X2, X3, X4     4      4
2      X1, X2, X3     6      3
3      X2, X3, X4     4      3
4          X1, X2     8      2
5          X2, X3     6      2
6          X3, X4     4      2
7              X1    10      1
8              X2     8      1
9              X3     6      1
10             X4     4      1
11             X5     2      1

And at the end the function should run properly on the huge matrix d.huge which leads to errors at the moment:

> seqRank2(d.huge)
Error in vector(mode = "list", length = 2^length(s)) : 
  vector size cannot be infinite

Toy data d:

d <- structure(list(id = structure(1:11, .Label = c("A", "B", "C", 
"D", "E", "F", "G", "H", "I", "J", "K"), class = "factor"), X1 = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, NA), X2 = c(11L, 12L, 13L, 
14L, 15L, 16L, 17L, 18L, NA, NA, NA), X3 = c(21L, 22L, 23L, 24L, 
25L, 26L, NA, NA, NA, NA, NA), X4 = c(31L, 32L, 33L, 34L, NA, 
NA, NA, NA, NA, NA, NA), X5 = c(41L, 42L, NA, NA, NA, NA, NA, 
NA, NA, NA, NA)), row.names = c(NA, -11L), class = "data.frame")

Toy data d.huge:

d.huge <- setNames(data.frame(matrix(1:15.3e5, 3e4, 51)), 
                   c("id", paste0("X", 1:50)))
d.huge[, 41:51] <- lapply(d.huge[, 41:51], function(x){
  x[which(x %in% sample(x, .05*length(x)))] <- NA
  x
})

Appendix (see comments latest answer):

d.huge <- read.csv("d.huge.csv")
d.huge.1 <- d.huge[sample(nrow(d.huge), 3/4*nrow(d.huge)), ]
d1 <- seqRank3(d.huge.1, 1.27e-1, 1.780e1)
d2 <- d1[complete.cases(d1), ]
dim(d2)
names(d2)
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • 1
    I might be daft, but why would you not pick X1,X2,X3 in your example data? I don't understand the rational for picking, which tells me maybe I'm not understanding the ask right... – iod Dec 04 '18 at 13:41
  • No, you're absolutely right, I've edited the picking. However, `X1, X2, X3, X4` is better since the resulting sequence is longer. I need to minimize the tradeoff by obtaining the maximum on sequence length by minimum of lost rows because of NAs. – jay.sf Dec 04 '18 at 13:50
  • I'm still confused. 2-3-4 has 4*3, just like 1-3-4 and 1-2-4. What's your rational for picking any of them over 1-2-3, which is 6*3, and what's the rational of picking 1-2-3-4 over that, since it's only 4*4. Is there a weighing involved? – iod Dec 04 '18 at 13:53
  • Ah, because 1-2-3-4, 1-2-3, 2-3-4 are continuous, I can't use gaps like 1-2-*-4. I've edited the question accordingly. – jay.sf Dec 04 '18 at 13:57
  • You have 2 conditions, but do not specify their order of precedence (if there is one) or how they are weighted (if they are). If we have 3 columns with 20 rows, or 4 columns with 19 rows, which should we prefer? How do we decide? – dww Dec 04 '18 at 14:02
  • @dww Since iod already mentioned weighting, I would weight the output in favour of sequences' lengths rather than number of rows. However 4x19 would be worse than 3x200. – jay.sf Dec 04 '18 at 14:09
  • You'll have to be more specific than that. – iod Dec 04 '18 at 14:17
  • @iod Ok, the aggregation in second part of the function needs to be enhanced so that it only considers continuous sequences. Under these sequences those with more rows by equal sequence lengths are better and therefore higher in ranking. But isn't the calculation of subsets the bigger problem? – jay.sf Dec 04 '18 at 14:26

3 Answers3

1

Convert to matrix and calculate Na counts for each column:

dm <- is.na(d[, -1])
na_counts <- colSums(dm)
x <- data.frame(na_counts = na_counts, non_na_count = nrow(dm) - na_counts)
x <- as.matrix(x)

# create all combinations for column indexes:
nx <- 1:nrow(x)
psr <- do.call(c, lapply(seq_along(nx), combn, x = nx, simplify = FALSE))
# test if continuous:
good <- sapply(psr, function(y) !any(diff(sort.int(y)) != 1L))
psr <- psr[good == T] # remove non continuous
# for each combo count nas and non NA:
s <- sapply(psr, function(y) colSums(x[y, , drop = F]))

# put all together in table:
res <- data.frame(var_count = lengths(psr), t(s))
res$var_indexes <- sapply(psr, paste, collapse = ',')
res
#    var_count na_counts non_na_count var_indexes
# 1          1         1           10           1
# 2          1         3            8           2
# 3          1         5            6           3
# 4          1         7            4           4
# 5          1         9            2           5
# 6          2         4           18         1,2
# 7          2         8           14         2,3
# 8          2        12           10         3,4
# 9          2        16            6         4,5
# 10         3         9           24       1,2,3
# 11         3        15           18       2,3,4
# 12         3        21           12       3,4,5
# 13         4        16           28     1,2,3,4
# 14         4        24           20     2,3,4,5
# 15         5        25           30   1,2,3,4,5

# choose

As var indexes are sorted, for speed we can use simply:

good <- sapply(psr, function(y) !any(diff(y) != 1L))
minem
  • 3,640
  • 2
  • 15
  • 29
  • @iod so we do not get a list of list, probably, there are different ways to achieve this. – minem Dec 04 '18 at 14:51
  • You've impressingly increased speed about 500 times. However the output is not right. The code is counting NAs in columns and therefore accepts rows with NAs but it shouldn't. – jay.sf Dec 04 '18 at 15:04
  • To be sure, the output of `seqRank2()` is perfectly right, only the non-continuous have to be dropped from the output. In your new code the `non_na_count` output should be 10 for 1-2-3-4-5, 16 for 1-2-3-4, 18 for 1-2-3, and 12 for 2-3-4. – jay.sf Dec 04 '18 at 15:11
1

This takes less than one second on the huge data

l1 = combn(2:length(d), 2, function(x) d[x[1]:x[2]], simplify = FALSE)
# If you also need "combinations" of only single columns, then uncomment the next line
# l1 = c(d[-1], l1)
l2 = sapply(l1, function(x) sum(complete.cases(x)))

score = sapply(1:length(l1), function(i) NCOL(l1[[i]]) * l2[i])
best_score = which.max(score)
best = l1[[best_score]]

The question was unclear about how to rank the various combinations. We can use different scoring formulae to generate different preferences. For example, to weight number of rows versus columns separately we can do

col_weight = 2
row_weight = 1
score = sapply(1:length(l1), function(i) col_weight*NCOL(l1[[i]]) +  row_weight * l2[i])
dww
  • 30,425
  • 5
  • 68
  • 111
  • Thx! I've tested the code on real data and it seemingly does what I need in a blink of an eye, I'm not sure though: (1) I'm confused about the two sapplys you wrote, the first is `NCOL * l2` the second it's `NCOL + l2`. (2) I played with the weights and they switch to favor rows or columns quite lately. More precisely I found a (somewhat funny) threshold around row_weight:1.27e-1 / col_weight:1.780e1, but it switches only from *some* columns and *many* rows to *all* columns and *few* rows. Tested with [this data](https://www.dropbox.com/s/jmy60ftn3ko20go/d.huge.csv) w/ code in appendix edit. – jay.sf Dec 05 '18 at 11:27
  • Glad that this worked well for generating the options. I also showed you how to implement *some* functions for ranking the options, even though you did not specify the ranking method you want to use. It is really outside the scope of SO to advise you on that here. If you don't know what criterion you should use to select the best combinations, https://stats.stackexchange.com/ may be a more appropriate forum to ask about that. But it will depend heavily on your specific use case, and is likely something you may have to figure out yourself – dww Dec 05 '18 at 21:16
  • Probably you're right. Your code has definitely pushed things forward. Thank you for putting work into this. – jay.sf Dec 07 '18 at 00:19
1

Just to clarify, the seqsubm function from TraMineR has no problem at all with NAs, nor with sequences of different length. However, the function expects a state sequence object (to be created with seqdef) as input.

The function seqsubm is for computing substitution costs (i.e. dissimilarities) between states by means of different methods. You probably refer to the method ('TRATE') that derives the costs from the observed transition probabilities, namely as 2-p(i|j) - p(j|i), where p(i|j) is the probability to be in state i in t when we were in state j in t-1. So, all we need are the transition probabilities, which can easily be estimated from a set of sequences of different length or with gaps within them.

I illustrate below using the ex1 data that ships with TraMineR. (Due to the high number of different states in your toy example, the resulting matrix of substitution costs would be too large (28 x 28) for this illustration.)

library(TraMineR)
data(ex1)
sum(is.na(ex1))

# [1] 38

sq <- seqdef(ex1[1:13])
sq

#    Sequence                 
# s1 *-*-*-A-A-A-A-A-A-A-A-A-A
# s2 D-D-D-B-B-B-B-B-B-B      
# s3 *-D-D-D-D-D-D-D-D-D-D    
# s4 A-A-*-*-B-B-B-B-D-D      
# s5 A-*-A-A-A-A-*-A-A-A      
# s6 *-*-*-C-C-C-C-C-C-C      
# s7 *-*-*-*-*-*-*-*-*-*-*-*-*

sm <- seqsubm(sq, method='TRATE')
round(sm,digits=3)

#      A-> B->   C-> D->
# A->   0 2.000   2 2.000
# B->   2 0.000   2 1.823
# C->   2 2.000   0 2.000
# D->   2 1.823   2 0.000

Now, it is not clear to me what you want to do with the state dissimilarities. Inputting them in a clustering algorithm, you would cluster the states. If you want to cluster the sequences, then you should first compute dissimilarities between sequences (using seqdist and possibly passing the matrix of substitution costs returned by seqsubm as sm argument) and then input the resulting distance matrix in the clustering algorithm.

Gilbert
  • 3,570
  • 18
  • 28
  • Thanks for this answer @Gilbert. I just wanted to ask that question, you beat me to it. The problem occurs exactly in the next step when I want to cluster the sequences, according to your example with `dist.om <- seqdist(sq, method="OM", indel=1, sm=sm)`. Even when I add option `with.missing=TRUE` `seqdist()` wants rather a 5x5 matrix than the 4x4 matrix that `seqsubm()` has given. I am sure it is worth asking this as another question what I have done here: https://stackoverflow.com/q/53701467/6574038 – jay.sf Dec 10 '18 at 08:01
  • 1
    So see my [answer](https://stackoverflow.com/a/53705905/1586731) to this new question. – Gilbert Dec 10 '18 at 13:11