Subset R list by partial string match of sublist element against character vector, using base R

Question

My actual case is a list of combined header strings and corresponding data as sub-lists; I wish to subset the list to return a list of sub-lists , i.e the same structure, that only contain the sub-lists whose header strings contain strings that match the strings in a character vector.

Test Data:

lets <- letters
x <- c(1,4,8,11,13,14,18,22,24)

ls <- list()
for (i in 1:9) {
  ls[[i]] <- list(hdr = paste(lets[x[i]:(x[i]+2)], collapse=""), 
                  data = seq(1,rnd[i]))
}

filt <- c("bc", "lm", "rs", "xy")

To produce a result list, as returned by:

logical_match <- c(T, F, F, T, F, F, T, F, T) 
ls_result <- ls[logical_match]

So the function I seek is: ls_result <- fn(ls, filt)

I've looked at: subset list by dataframe; partial match with %in%; nested sublist by condition; subset list by logical condition; and, my favorite, extract sublist elements to array - this uses some neat purr and dplyr solutions, but unfortunately these aren't viable, as I'm looking for a base R solution to make deployment more straightforward (I'd welcome extended R solutions, for interest, of course).

I'm guessing some variation of logical_match <- lapply(ls, fn, '$hdr', filt) is where I'm heading; I started with pmatch(), and wondered how to incorporate grep, but I'm struggling to see how to generate the logical_match vector.

Can someone set me on the right track, please?

EDIT: when agrepl() is applied to the real data, this becomes trickier; the header string, hdr, may be typically 255 characters long, whilst a string element of the filter vector , filt is of the order of 16 characters. The default agrepl() max.distance argument of 0.1 needs adjusted to somewhere between 0.94 and 0.96 for the example below, which is pretty tight. Even if I use the lower end of this range, and apply it to the ~360 list elements, the function returns a load of total non-matches.

> hdr <- "#CCHANNELSDI12-PTx|*|CCHANNELNO2|*|CDASA1570|*|CDASANAMEShenachieBU_1570|*|CTAGSODATSID|*|CTAGKEYWISKI_LIVE,ShenachieBU_1570,SDI12-PTx,Highres|*|LAYOUT(timestamp,value)|*|RINVAL-777|*|RSTATEW6|*|RTIMELVLhigh-resolution|*|TZEtc/GMT|*|ZDATE20210110130805|*|"

> filt <- c("ShenachieBU_1570", "Pitlochry_4056")

> agrepl(hdr, filt, max.distance = 0.94)
[1]  TRUE FALSE

score 2 · Accepted Answer · answered Jan 13 '21 at 19:58

2

You could do:

Filter(function(x)any(agrepl(x$hdr,filt)), ls)

You could reduce the code to:

Filter(function(x)grepl(paste0(filt, collapse = "|"), x$hdr), ls)

answered Jan 13 '21 at 19:58

Onyambu

67,392
3
24
53

Neat! _agrep1()_ is pretty arcane. So why would that be `agrepl(x$hdr,filt)` rather than `agrepl(filt, x$hdr)`? _filt_ is the pattern to match, and _x$hdr_ is the 'character vector where matches are sought'. If I reverse the arguments, which would appear to follow the documentation, it generates the error: _argument 'pattern' has length > 1 and only the first element will be used_ – jack_sprat Jan 13 '21 at 21:30
re. `Filter(function(x)any(agrepl(x$hdr,filt)), ls)` so _any()_ is like a vector **or**; Semantically: Filter _ls_ by any _hdr_ value fuzzy match with _filt_ – jack_sprat Jan 13 '21 at 21:39

score 0 · Answer 2 · answered Jan 13 '21 at 20:01

0

We can also do

library(purrr)
library(stringr)
keep(ls, ~ str_detect(.x$hdr, str_c(filt, collapse = "|")))

-output

#[[1]]
#[[1]]$hdr
#[1] "abc"

#[[1]]$data
#[1] 1


#[[2]]
#[[2]]$hdr
#[1] "klm"

#[[2]]$data
#[1] 1 2 3 4


#[[3]]
#[[3]]$hdr
#[1] "rst"

#[[3]]$data
#[1] 1 2 3 4 5 6 7


#[[4]]
#[[4]]$hdr
#[1] "xyz"

#[[4]]$data
#[1] 1 2 3 4 5 6 7 8 9

answered Jan 13 '21 at 20:01

akrun

874,273
37
540
662

Tidy! "str_detect(string, pattern, negate = FALSE) : Vectorised over string and pattern"; so why do we need to collapse _filt_ vector to a single string? – jack_sprat Jan 13 '21 at 21:45
(from doc'n) `keep()` is similar to `Filter()`, but the argument order is more convenient, and the evaluation of the predicate function .p is stricter. I like this solution, but it's not my chosen solution, just because it uses extension libraries beyond base R. Thank you. – jack_sprat Jan 13 '21 at 21:48
@jack_sprat regarding your query, there is a condition that both string and pattern length should be same – akrun Jan 13 '21 at 21:57
re. equal string & pattern length: thanks - that isn't apparent to me from the documentation! – jack_sprat Jan 13 '21 at 22:01
@jack_sprat you can try `v1 <- c("st1", "st2", "r12", "st1"); pat <- c("st", "r1"); str_detect(v1, pat)#[1] TRUE FALSE FALSE FALSE` whereas if it is pasted i..e `str_detect(v1, str_c(pat, collapse="|"))#[1] TRUE TRUE TRUE TRUE` In the first case it is elementwise comparison, thus the shorter length of pattern vector will have to recycle i.e. 1st element will compare to 1st elemnet of v1, 2nd to second element, then again 1st element, 2nd element – akrun Jan 13 '21 at 22:04

Subset R list by partial string match of sublist element against character vector, using base R

2 Answers2