split() returns "longer object length is not a multiple of shorter object length"

Question

Context

I asked this question recently:

Comparing partitions from split() using a nested for loop containing an if statement

where I needed to compare partitions generated by split() from a distance matrix using the code fix provided by @robertdj

set.seed(1234) # set random seed for reproducibility

# generate random normal variates
x <- rnorm(5)
y <- rnorm(5)

df <- data.frame(x, y) # merge vectors into dataframe
d <- dist(x) # generate distance matrix

splt <- split(d, 1:5) # split data with 5 values in each partition


for (i in 1:length(splt)) {
for (j in 1:length(splt)) {
    if (i != j) {
        a <- length(which(splt[[i]] >= min(splt[[j]]))) / length(splt[[i]])
        b <- length(which(splt[[j]] <= max(splt[[i]]))) / length(splt[[j]])
        }
    }
}

I generated a MWE where each split contained the same number of elements. I did this just for illustrative purposes, fully knowing that this would not necessarily hold for real data.

As per @Robert Hacken's comment if I instead do

d <- na.omit(d[lower.tri(d)])

I get partitions of unequal length.

Real Data

However my real data does not have the "same size" property. My real data contains many more partitions than only 5 in my MWE.

Here is my code

splt <- split(dist_matrix, sub("(?:(.*)\\|){2}(\\w+)\\|(\\w+)\\|.*?$", "\\1-\\2", colnames(dist_matrix)))

The distance matrix dist_matrix contains FASTA headers from which I extract the species names.

I then use splt above in the doubly nested loop.

For instance, splt[[4]] contains 5 values, whereas splt[[10]] contains 9.

splt[[4]]
[1] 0.1316667 0.1383333 0.1166667 0.1333333 0.1216667

splt[[10]]
 [1] 0.1450000 0.1483333 0.1316667 0.1316667 0.1333333 0.1333333 0.1166667 0.1166667 0.1200000

Expected Output

For my real problem, each partition corresponds to distances for a single species to all other unique species. So, if Species X has two DNA sequences representing it and there are 10 species in total, the partition for Species X should contain 20 distances. However I don't want the partition to include the distance between the two sequences for species A.

splt would thus contain 10 partitions (each not necessarily of the same length) for all species

The expected output of a and b is a number between 0-1 inclusive. I think these numbers should be small in my real example, but they are large when I try to run my code, which I think is a consequence of the warning().

What I've Done

I've read on SO that %in% is typically used to resolve the warning

In splt[[i]] == splt[[j]] :
  longer object length is not a multiple of shorter object length

except in my case, I believe I would need %notin% <- Negate(%in%).

However, %notin% gives the error in my original post

the condition has length > 1

Question

How can my nested loop be altered to remove the warning?

I'm confused by what you're eventually attempting to achieve. You create/overwrite the values of `a` and `b` when the `if` condition is true ... _every time_ it is true ... and you do nothing with it. This means a couple of things: (1) if the conditional is never true, then `a` and `b` are never defined, which is likely an error; (2) if it is true more than once, you don't know this, and you only know that the values of those variables were valid at some point during your comparisons. — r2evans, Nov 25 '22 at 16:59
It seems like the conditional statement `!(all(splt[[i]] == splt[[j]]))` is simply checking to ensure that `splt[[i]]` and `splt[[j]]` are not the same element of `splt`. You could replace this with `i != j`. — jdobres, Nov 25 '22 at 16:59
Regardless, it seems as if your assumption that your vector of values will always split into equal-length vectors is not safe. Logically, when comparing `splt[[4]]` and `splt[[10]]`, what is your expected output? Perhaps we need an outer comparison to be more resilient to different-length vectors. Please provide "expected output" given the process you hope to achieve and the previously-defined sample data. (Also, questions should generally be self-sufficient, please copy the generation of sample data from your previous question to this one.) Thanks! — r2evans, Nov 25 '22 at 17:02
@r2evans I've added more details and am happy to clarify further if needed. — compbiostats, Nov 25 '22 at 17:32
@compbiostats Are you sure what you are getting from `split` is what you want in the first place? Your comment speaks about 5 values in each partition in `splt` but there are only 2 values in each of the 5 elements of `splt`. This is because `d` is not a matrix but a `dist` object (the lower triangle of the distance matrix) and so it is treated as a vector of length 10 by `split`. — Robert Hacken, Nov 25 '22 at 17:52
@RobertHacken Good point! If I do `d <- na.omit(d[lower.tri(d)])` I now get partitions of differing lengths, which is closer in structure to my real data — compbiostats, Nov 25 '22 at 18:03
@compbiostats Please, share at least part of your real data with us and the exact code you use with it. I am afraid you might be actually doing something else than you want to. — Robert Hacken, Nov 25 '22 at 18:08
*Again*: Please provide "expected output" given the process you hope to achieve and the previously-defined sample data. By this, I mean some _explicit object_, such as `c(0, 0.2, 0.5, 0.23412, 1)` or a matrix or something else. Base this on the data we have, the random data. This likely means you need to manually calculate what each of those values should be; even if you calculate half of them and provide it in the correct structure (and clearly identify which values are real, which are placeholders), it's a start. — r2evans, Nov 25 '22 at 18:50

r2evans · Accepted Answer · 2022-12-01T21:15:24.603

I'm going to go out on a limb by interpreting parts of what you say, discarding your code, and seeing what I can come up with. If nothing else, it may spark conversation to explain what about my interpretations are correct (and which are incorrect).

Starting with the splt as generated by the random data, then replacing elements 4 and 5 with longer vectors,

set.seed(1234)
x <- rnorm(5)
y <- rnorm(5)
df <- data.frame(x, y)
d <- dist(x)
splt <- split(d, 1:5)
splt[[4]] <- rnorm(4)
splt[[5]] <- rnorm(10)

We have:

splt <- list("1" = c(1.48449499149608, 2.62312694474001), "2" = c(2.29150692606848, 0.15169544670039), "3" = c(1.13863195324393, 3.43013887931241), "4" = c(-0.477192699753547, -0.998386444859704, -0.77625389463799, 0.0644588172762693), "5" = c(-0.693720246937475, -1.44820491038647, 0.574755720900728, -1.02365572296388, -0.0151383003641817, -0.935948601168394, 1.10229754620026, -0.475593078869057, -0.709440037512506, -0.501258060594761))
splt
# $`1`
# [1] 1.484495 2.623127
# $`2`
# [1] 2.2915069 0.1516954
# $`3`
# [1] 1.138632 3.430139
# $`4`
# [1] -0.47719270 -0.99838644 -0.77625389  0.06445882
# $`5`
#  [1] -0.6937202 -1.4482049  0.5747557 -1.0236557 -0.0151383 -0.9359486  1.1022975 -0.4755931 -0.7094400 -0.5012581

You reference expressions like which(splt[[i]] >= min(splt[[j]])), which I'm interpreting to mean *"what is the ratio of splt[[i]] that is above the max value in splt[[j]]. Since we're comparing (for example) splt[[1]] with all of splt[[2]] through splt[[5]] here, and likewise for the others, we're going to have a square matrix where the diagonal is splt[[i]]-vs-splt[[i]] (likely not interesting).

Some quick math so we know what we should end up with:

splt[[1]]
# [1] 1.484495 2.623127
range(splt[[2]])
# [1] 0.1516954 2.2915069

Since 1 from [[1]] is greater than 2's max of 2.29, we expect 0.5 in a comparison between the two (for >= max(.)); similarly, none of [[1]] is below 0.15, so we expect a 0 there.

Similarly, [[5]] over [[4]]:

splt[[5]]
#  [1] -0.6937202 -1.4482049  0.5747557 -1.0236557 -0.0151383 -0.9359486  1.1022975 -0.4755931 -0.7094400 -0.5012581
range(splt[[4]])
# [1] -0.99838644  0.06445882

### 2 of 10 are greater than the max
sum(splt[[5]] >= max(splt[[4]])) / length(splt[[5]])
# [1] 0.2

### 9 of 10 are lesser than the min
sum(splt[[5]] <= min(splt[[4]])) / length(splt[[5]])
# [1] 0.2

We can use outer, but sometimes that can be confusing, especially since in this case we'd need to Vectorize the anon-func passed to it. I'll adapt your double-for loop premise into nested sapply calls.

Greater than the other's max

sapply(splt, function(y) sapply(setNames(splt, paste0("max", seq_along(splt))), function(z) sum(y >= max(z)) / length(y)))
#        1   2   3    4   5
# max1 0.5 0.0 0.5 0.00 0.0
# max2 0.5 0.5 0.5 0.00 0.0
# max3 0.0 0.0 0.5 0.00 0.0
# max4 1.0 1.0 1.0 0.25 0.2
# max5 1.0 0.5 1.0 0.00 0.1

Interpretation and subset validation:

1 with max of 2: comparing [[1]] (first column) with the max value from [[2]] (second row), half of 1's values are greater, so we have 0.5 (as expected).
5 with max of 4: comparing [[5]] (fifth column) with the max value from [[4]] (fourth row), 0.2 meet the condition.

Less than the other's min

sapply(splt, function(y) sapply(setNames(splt, paste0("min", seq_along(splt))), function(z) sum(y <= min(z)) / length(y)))
#        1   2   3    4   5
# min1 0.5 0.5 0.5 1.00 1.0
# min2 0.0 0.5 0.0 1.00 0.8
# min3 0.0 0.5 0.5 1.00 1.0
# min4 0.0 0.0 0.0 0.25 0.2
# min5 0.0 0.0 0.0 0.00 0.1

Same two pairs:

1 with min of 2 (row 2, column 1) is 0, as expected
5 with min of 4 (row 4, column 5) is 0.2, as expected

Edit: @compbiostats pointed out that while sum(..) should produce the same results as length(which(..)), the latter may be more robust to missing-data (e.g., NA values, c.f., Difference between sum(), length(which()), and nrow() in R). For sum(..) to share that resilience, we should add na.rm=TRUE) to both sum(.) and min(.) in the above calls. Thanks @compbiostats!

Thanks! Just seeing this now, so I will take some time to digest it. Hopefully it generates some conversation. — compbiostats, Nov 26 '22 at 02:51
To be candid, the next bit of the conversation needs to come from you now. What in my assumptions is incorrect? Are you able to provide concrete expected output? — r2evans, Nov 26 '22 at 03:21
This looks fine to me. I'll apply it to my own data and integrate into the pipeline I am currently working on with a colleague. Basically, these results will be outputted to a text file for downstream analysis. Once I do this, then I think I'll have a better idea regarding the desired output. — compbiostats, Nov 27 '22 at 01:16
One small correction in your code: the anonymous functions should be `function(z) sum(y >= min(z)) / length(y)` and `function(z) sum(y <= max(z)) / length(y)`You just had the min and max switched. — compbiostats, Nov 28 '22 at 20:03
Yeah, missed that ... I think I projected an assumption from another topic I was working at the same time, thanks for seeing that and reporting back. Glad it helped get to a resolution! — r2evans, Nov 28 '22 at 20:08
Thanks again. Is there any reason why you used `sum(...)` as opposed to `length(which(...))` in the `sapply` call? I don't see the equivalence. — compbiostats, Dec 01 '22 at 19:28
They are functionally the same, I don't know when one will work when the other does not. Having said that, `length(which(..))` is two calls, `sum(..)` is just one call (with the same input); with "small" data (length < 1000), the latter is 3x faster; with larger data, the difference is less extreme but `sum` is still clearly faster. — r2evans, Dec 01 '22 at 19:33
To add to my last comment, I found this SO post https://stackoverflow.com/questions/25065501/difference-between-sum-lengthwhich-and-nrow-in-r, which answers my question. — compbiostats, Dec 01 '22 at 19:53