Grouping elements of a character vector by the first n letters that match each other

Question

Goal: I want to find a way to group a character vector like:

x <- c("a800k blue 5", "a800j", "bb-blah5", "a800 7", "bb-blah2", "bb-blah3")

into groups with sort of "lead matches" that give the minimum elements so that they would be called in a grep search. So the solution to the toy example above would be:

solution <- c("a800", "bb-blah")

because a grep search of x using the pattern "a800" would yield all 3 elements that start with "a800."

Note: I can make very few assumptions about the character strings that will be contained in the vector. There will be lengths varying between just a few and quite long strings (possibly 10 or more), containing combinations of numbers, letters, spaces, and some special characters that make life very difficult.

So I would love a function that works something like intersect, maybe, but on each individual string. Any thoughts?

Perhaps you find something useful here, at least to get you started: [Find common substrings between two character variables](https://stackoverflow.com/questions/16196327/find-common-substrings-between-two-character-variables), [longest common substring in R finding non-contiguous matches between the two strings](https://stackoverflow.com/questions/28261825/longest-common-substring-in-r-finding-non-contiguous-matches-between-the-two-str) — Henrik, Dec 07 '17 at 21:44
[R implementation for Finding the longest common starting substrings in a set of strings](https://stackoverflow.com/questions/28273716/r-implementation-for-finding-the-longest-common-starting-substrings-in-a-set-of) — Henrik, Dec 07 '17 at 21:46

Henrik · Answer 1 · 2017-12-07T23:55:22.610

0

There may be more elegant (efficient) ways, but this works on your sample strings. The longest common prefix is calculated (Biobase::lcPrefixC*) on pairwise combinations (combn) of strings.

source("https://bioconductor.org/biocLite.R")
biocLite("Biobase")

unique(combn(seq_along(x), 2, FUN = function(cm) lcPrefixC(x[cm])))
# [1] "a800"    ""        "bb-blah"

It's unclear if you are trading between length of prefix and size of groups. For example:

x <- c("1234", "1235", "1245", "1246")

Would you rather have one larger group with a shorter prefix ("12"), or two smaller groups with with longer prefix ("123" and "124")?

*"lcPrefixC is a faster implementation in C. [But note that] It only handles ascii characters". See also lcPrefix.

edited Dec 07 '17 at 23:55

answered Dec 07 '17 at 23:50

Henrik

65,555
14
143
159

I should mention that I'm under the constraint of using R 3.2.0, and cannot use most packages written since. Ideally, I would use base R. – D. Scedastic Dec 08 '17 at 13:16
OK, I see! So you didn't manage to run the code I proposed at all? Cheers. – Henrik Dec 08 '17 at 13:23

Grouping elements of a character vector by the first n letters that match each other

1 Answers1