2

Goal: I want to find a way to group a character vector like:

x <- c("a800k blue 5", "a800j", "bb-blah5", "a800 7", "bb-blah2", "bb-blah3")

into groups with sort of "lead matches" that give the minimum elements so that they would be called in a grep search. So the solution to the toy example above would be:

solution <- c("a800", "bb-blah")

because a grep search of x using the pattern "a800" would yield all 3 elements that start with "a800."

Note: I can make very few assumptions about the character strings that will be contained in the vector. There will be lengths varying between just a few and quite long strings (possibly 10 or more), containing combinations of numbers, letters, spaces, and some special characters that make life very difficult.

So I would love a function that works something like intersect, maybe, but on each individual string. Any thoughts?

oguz ismail
  • 1
  • 16
  • 47
  • 69
  • Perhaps you find something useful here, at least to get you started: [Find common substrings between two character variables](https://stackoverflow.com/questions/16196327/find-common-substrings-between-two-character-variables), [longest common substring in R finding non-contiguous matches between the two strings](https://stackoverflow.com/questions/28261825/longest-common-substring-in-r-finding-non-contiguous-matches-between-the-two-str) – Henrik Dec 07 '17 at 21:44
  • 1
    [R implementation for Finding the longest common starting substrings in a set of strings](https://stackoverflow.com/questions/28273716/r-implementation-for-finding-the-longest-common-starting-substrings-in-a-set-of) – Henrik Dec 07 '17 at 21:46

1 Answers1

0

There may be more elegant (efficient) ways, but this works on your sample strings. The longest common prefix is calculated (Biobase::lcPrefixC*) on pairwise combinations (combn) of strings.

source("https://bioconductor.org/biocLite.R")
biocLite("Biobase")

unique(combn(seq_along(x), 2, FUN = function(cm) lcPrefixC(x[cm])))
# [1] "a800"    ""        "bb-blah"

It's unclear if you are trading between length of prefix and size of groups. For example:

x <- c("1234", "1235", "1245", "1246")

Would you rather have one larger group with a shorter prefix ("12"), or two smaller groups with with longer prefix ("123" and "124")?


*"lcPrefixC is a faster implementation in C. [But note that] It only handles ascii characters". See also lcPrefix.


Henrik
  • 65,555
  • 14
  • 143
  • 159
  • I should mention that I'm under the constraint of using R 3.2.0, and cannot use most packages written since. Ideally, I would use base R. – D. Scedastic Dec 08 '17 at 13:16
  • OK, I see! So you didn't manage to run the code I proposed at all? Cheers. – Henrik Dec 08 '17 at 13:23