How can I find a common phrase(s) within a character vector. For example, I have the character vector below, and I want to determine that Foo Bar 1
is common for 2 & 3 and Foo Bar 2
is common for 4 & 5, not knowing upfront that I'm looking for Foo Bar 1
and Foo Bar 2
.
Sample input is
a <- c("index", "bla Foo Bar 1", "blah Foo Bar 1",
"blaa Foo Bar 2", "blahh Foo Bar 2")
and the desired output is something like
output <- list(`Foo Bar 1` = c(2, 3), `Foo Bar 2` = c(4, 5))
The output format can vary, but I'm looking for the common phrase and corresponding location in the original vector.
I'd like to match the longest common phrase (order matters), so in this case matching on Foo Bar
alone is not desirable. Beginning and ending spaces can also be returned (I can strip off later is necessary). In this example, I also would not want to match Foo Bar 1a
, so we should assume that words are separated by spaces.
My question is similar to this one asked earlier, although in my situation I have a single vector and want to match on complete words instead of characters.