2

Is there an easy way to determine if one vector is nested within another? In other words, in the example below, each value of bar is associated with one and only one value of foo, so bar is nested within foo.

data.frame(foo=rep(seq(4), each=4), bar=rep(seq(8), each=2))

To clarify, here is the desired result:

foo <- rep(seq(4), each=4)
bar <- rep(seq(8), each=2)
qux <- rep(seq(8), times=2)
# using a fake operator for illustration:
bar %is_nested_in% foo  # should return TRUE
qux %is_nested_in% foo  # should return FALSE
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
drammock
  • 2,373
  • 29
  • 40
  • Do you need `!any(duplicated(rle(bar)$values)) & all(foo %in% rle(bar)$values )` – akrun Dec 30 '16 at 17:58
  • @akrun the first part (`!any(duplicated(rle(bar)$values))`) is a stronger constraint than I want. If `foo` were `c(1,1,1,1,2,2,2,2)` and `bar` were `c(1,2,1,2,3,4,3,4)` then `bar` would still be nested within `foo` – drammock Dec 30 '16 at 18:02

2 Answers2

8

Suppose you have two factors f and g, and want to know whether g is nested in f.

Method 1: For people who love linear algebra

Consider the design matrix for two factors:

Xf <- model.matrix(~ f + 0)
Xg <- model.matrix(~ g + 0)

If g is nested in f, then the column space of Xf must be a subspace of the column space of Xg. In other word, for any linear combination of Xf's columns: y = Xf %*% bf, equation Xg %*% bg = y can be solved exactly.

y <- Xf %*% rnorm(ncol(Xf))  ## some random linear combination on `Xf`'s columns
c(crossprod(round(.lm.fit(Xg, y)$residuals, 8)))  ## least squares residuals
## if this is 0, you have nesting.

Method 2: For people who love statistics

We check contingency table:

M <- table(f, g)

If all columns have only one non-zero entry, you have g nested in f. In other words:

all(colSums(M > 0L) == 1L)
## `TRUE` if you have nesting

Comment: For any method, you can squeeze the code into one line easily.

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • interesting approach, and I can see why it would work, though it seems like a bit of overkill for this specific problem. – drammock Dec 30 '16 at 18:27
  • your edited answer is much improved, and does a better job of explaining why it works. I still feel that the linear-algebra-based solution is overkill (i.e., the intermediate steps of generating a random `y` and solving a linear model), and it's easy to get the wrong answer if your data are numeric and you forget to call `factor` before generating the model matrix. But the contingency table approach you've added is nice and succinct and works with numeric vectors. – drammock Jan 06 '17 at 00:28
  • For factors with many levels, the contingency table is large and maybe sparse. So `M <- xtabs(~ f + g, sparse = TRUE); all(Matrix::colSums(M > 0) == 1L)` should be used. – Zheyuan Li Sep 12 '18 at 19:56
1

I think this will work:

nested_in <- function(b, a) {
    df <- data.frame(a, b)
    all(sapply(split(df, df$b), function(i) length(unique(i$a)) < 2))
}

foo <- rep(seq(4), each=4)
bar <- rep(seq(8), each=2)
qux <- rep(seq(8), times=2)    

nested_in(bar, foo)  # TRUE
nested_in(qux, foo)  # FALSE
drammock
  • 2,373
  • 29
  • 40