Splitting merged words (with mini-dictionary)

Question

I have a set of words: some of which are merged terms, and others that are just simple words. I also have a separate list of words that I am going to use to compare with my first list (as a dictionary) in order to 'un-merge' certain words.

Here's an example:

ListA <- c("dopamine", "andthe", "lowerswim", "other", "different")
ListB <- c("do", "mine", "and", "the", "lower", "owe", "swim")

My general procedure would be something like this:

search for pattern from ListB that occurs twice in a word in ListA where the merged terms are consecutive (no spare letters in the word). So for example, from ListA 'lowerswim' would match with 'lower' and 'swim' not 'owe' and 'swim'.
for each selected word, check if that word exists in ListB. If yes, then keep it in ListA. Otherwise, split the word into the two words matched with words from ListB

Does this sound sensible? And if so, how do I implement it in R? Maybe it sounds quite routine but at the moment I'm having trouble with:

searching for words inside words. I can match words from lists no problem but I'm not sure how I use grep or equivalent to go further than this
declaring that the words must be consecutive. I've been thinking about this for a while but I can't get to seem to try anything that has worked

Can anyone please send me in the right direction?

Would some of the helper functions in `stringr` be of help? http://journal.r-project.org/archive/2010-2/RJournal_2010-2_Wickham.pdf I think a few of them would get you going pretty quickly. — hrbrmstr, Mar 09 '14 at 20:25
@hrbrmstr I wasn't aware of the `stringr` package - I shall investigate now! Thank you for your suggestion. — user1988898, Mar 09 '14 at 20:27

score 3 · Accepted Answer · answered Mar 09 '14 at 21:23

I think the first step would be to build all the combined pairs from ListB:

pairings <- expand.grid(ListB, ListB)
combos <- apply(pairings, 1, function(x) paste0(x[1], x[2]))
combos
#  [1] "dodo"       "minedo"     "anddo"      "thedo"      "lowerdo"    "owedo"      "swimdo"    
#  [8] "domine"     "minemine"   "andmine"    "themine"    "lowermine"  "owemine"    "swimmine"  
# [15] "doand"      "mineand"    "andand"     "theand"     "lowerand"   "oweand"     "swimand"   
# [22] "dothe"      "minethe"    "andthe"     "thethe"     "lowerthe"   "owethe"     "swimthe"   
# [29] "dolower"    "minelower"  "andlower"   "thelower"   "lowerlower" "owelower"   "swimlower" 
# [36] "doowe"      "mineowe"    "andowe"     "theowe"     "lowerowe"   "oweowe"     "swimowe"   
# [43] "doswim"     "mineswim"   "andswim"    "theswim"    "lowerswim"  "oweswim"    "swimswim"

You can use str_extract from the stringr package to extract the element of combos that is contained within each element of ListA, if such an element exists:

library(stringr)
matches <- str_extract(ListA, paste(combos, collapse="|"))
matches
# [1] NA          "andthe"    "lowerswim" NA          NA

Finally, you want to split the words in ListA that matched a pair of elements from ListB, unless this word is already in ListB. I suppose there are lots of ways to do this, but I'll use lapply and unlist:

newA <- unlist(lapply(seq_along(ListA), function(idx) {
  if (is.na(matches[idx]) | ListA[idx] %in% ListB) {
    return(ListA[idx])
  } else {
    return(as.vector(as.matrix(pairings[combos == matches[idx],])))
  }
}))
newA
# [1] "dopamine"  "and"       "the"       "lower"     "swim"      "other"     "different"

This is perfect. I spent a ridiculous amount of time working with `stringr` to try and get something to work and then I come back and you've produced this. — user1988898, Mar 10 '14 at 05:36

Splitting merged words (with mini-dictionary)

1 Answers1

Linked