1

How can I find a common phrase(s) within a character vector. For example, I have the character vector below, and I want to determine that Foo Bar 1 is common for 2 & 3 and Foo Bar 2 is common for 4 & 5, not knowing upfront that I'm looking for Foo Bar 1 and Foo Bar 2.

Sample input is

a <- c("index", "bla Foo Bar 1", "blah Foo Bar 1",
       "blaa Foo Bar 2", "blahh Foo Bar 2")

and the desired output is something like

output <- list(`Foo Bar 1` = c(2, 3), `Foo Bar 2` = c(4, 5))

The output format can vary, but I'm looking for the common phrase and corresponding location in the original vector.

I'd like to match the longest common phrase (order matters), so in this case matching on Foo Bar alone is not desirable. Beginning and ending spaces can also be returned (I can strip off later is necessary). In this example, I also would not want to match Foo Bar 1a, so we should assume that words are separated by spaces.

My question is similar to this one asked earlier, although in my situation I have a single vector and want to match on complete words instead of characters.

Community
  • 1
  • 1
user338714
  • 2,315
  • 5
  • 27
  • 36
  • You might want to provide some sample output. It's not clear to me what you're expecting for the output here. – Dason Aug 20 '14 at 19:51
  • I added more details, thanks for the suggestion. – user338714 Aug 20 '14 at 19:56
  • Be careful about `:` because in R this is a range generating function. Maybe you would rather have `c(2,3)` as common instead of `2:3`. – pbible Aug 20 '14 at 19:58
  • @user338714 And what is the logic that says `Foo Bar 1` should be a matching term but not `Foo Bar` which would match 2, 3, 4, 5. Why not "Foo" or "Bar" on their own? If there was something that contained "Foo Bar 10d" should that also match "Foo Bar 1" (Foo Bar 1 is a substring of Foo Bar 10d). Please add even more details. – Dason Aug 20 '14 at 20:00
  • Another good suggestion. – user338714 Aug 20 '14 at 20:00
  • @pbible Your suggestion didn't really matter here since those create the exact same vector... – Dason Aug 20 '14 at 20:01
  • @Dason, I guess he wants the longest match – David Arenburg Aug 20 '14 at 20:02
  • @Dason I know. But there is no guarantee that duplicates have to be consecutive right? I'm just making sure the OP knows the difference. – pbible Aug 20 '14 at 20:04
  • Your challenge is called the "Wiki: Longest common substring problem": http://stackoverflow.com/questions/16196327/find-common-substrings-between-two-character-variables – Vlo Aug 20 '14 at 20:07

1 Answers1

1

Here's something that could work. First, sample input

#sample input
a <- c("index", "bla Foo Bar 1", "blah Foo Bar 1",
       "blaa Foo Bar 2", "blahh Foo Bar 2")

then a function to break up string into phrases

allcomp <- function(x) {
    unname(do.call(c, lapply(1:length(x), 
        function(z) lapply(1:(length(x)-z+1), 
            function(i) x[i:(i+z-1)]))))
}

getPhrases <- function(a) {
    parts <- Map(allcomp, strsplit(a, " "))
    vals <- sapply(parts, sapply, paste0, collapse=" ")
    len <- sapply(parts, sapply, length)
    do.call(rbind, Map(cbind.data.frame, i=seq_along(a), v=vals, l=len))
}

Now we can tabulate

pp<-getPhrases(a)

tbl<-Map(function(x) list(ids=unique(x$i), len=max(x$l), 
    cnt=length(unique(x$i))), split(pp, pp$v))

maxlen<-max(sapply(tbl, `[[`,  "len") * as.numeric(sapply(tbl, `[[`,  "cnt")>1))
maxcnt<-max(sapply(tbl[sapply(tbl, `[[`,  "len")==maxlen], `[[`, "cnt"))


Map(function(x) x$ids, tbl[sapply(tbl, `[[`,  "len")==maxlen & 
    sapply(tbl, `[[`,  "cnt")==maxcnt])

which returns

$`Foo Bar 1`
[1] 2 3

$`Foo Bar 2`
[1] 4 5
MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • I'm running into an error with allcomp. Do I need to load a library or define another variable? Thanks. – user338714 Aug 20 '14 at 20:58
  • @user338714 Sorry, I forgot I was using that function. I've edited my post to include the definition. – MrFlick Aug 20 '14 at 21:05
  • This worked very well, thanks. I didn't realize it earlier, but I actually wanted to filter by max match count instead of max word count. Swapping the maxlen and maxcnt calcs did the trick. – user338714 Aug 21 '14 at 12:13