I have a set of words: some of which are merged terms, and others that are just simple words. I also have a separate list of words that I am going to use to compare with my first list (as a dictionary) in order to 'un-merge' certain words.
Here's an example:
ListA <- c("dopamine", "andthe", "lowerswim", "other", "different")
ListB <- c("do", "mine", "and", "the", "lower", "owe", "swim")
My general procedure would be something like this:
- search for pattern from ListB that occurs twice in a word in ListA where the merged terms are consecutive (no spare letters in the word). So for example, from ListA 'lowerswim' would match with 'lower' and 'swim' not 'owe' and 'swim'.
- for each selected word, check if that word exists in ListB. If yes, then keep it in ListA. Otherwise, split the word into the two words matched with words from ListB
Does this sound sensible? And if so, how do I implement it in R? Maybe it sounds quite routine but at the moment I'm having trouble with:
- searching for words inside words. I can match words from lists no problem but I'm not sure how I use
grep
or equivalent to go further than this - declaring that the words must be consecutive. I've been thinking about this for a while but I can't get to seem to try anything that has worked
Can anyone please send me in the right direction?