0

I am looking to create a new column from the intersecting words from two other columns containing strings:

sometext1 <- c('this is a text entry','here is another text entry','something else')
sometext2 <- c('text entry','text entry','no match here')
texts <- data.frame(sometext1=sometext1, sometext2=sometext2,stringsAsFactors=F)

This is my attempt that didn't produce any match:

texts$common <- paste(Reduce(intersect, list(strsplit(texts$sometext1,' '), strsplit(texts$sometext2,' '))), sep=" ", collapse=" ")

texts$common should look something like this:

1     'text entry'
2     'text entry'
3     ''

Thanks!!

amunategui
  • 1,130
  • 2
  • 11
  • 15
  • 1
    do you mean to find the longest common sequence of words? – Marat Talipov Feb 03 '15 at 18:42
  • 1
    BTW, you could avoid the need to convert `sometext1` and `sometext2` to character by using argument `stringsAsFactors=F` in the `data.frame` command. – Marat Talipov Feb 03 '15 at 18:43
  • Also, did you check this link: http://stackoverflow.com/questions/16196327/find-common-substrings-between-two-character-variables ? – Marat Talipov Feb 03 '15 at 18:45
  • A three-step approach with base R would be: `x <- lapply(texts, strsplit, " "); x <- Map(intersect, x[[1]], x[[2]]); texts$common <- sapply(x, paste0, collapse = " ")` – talat Feb 03 '15 at 18:46
  • @docendo discimus, yes, that's what I was looking for. Can you get the results back into the data frame column texts$common and make it an answer so I can check it? – amunategui Feb 03 '15 at 18:51
  • @amunategui, it should already be back in the data.frame after running those three commands. Will post as answer – talat Feb 03 '15 at 18:55
  • thanks Marat, I'll update my post – amunategui Feb 03 '15 at 18:55
  • 1
    @docendodiscimus: you're handle choice speaks eloquently to the summum. bonum. Vinimus, vidicus, vcodeRus – lawyeR Feb 03 '15 at 21:14
  • @lawyeR, you got it! I should add that to my "about me" section :D – talat Feb 03 '15 at 21:17
  • @docendodiscimus: I wish I knew a fraction of what you know so I could help others with R like you do. And, the garbled Julius Caesar may be completely wrong, I should note, but it struck me as funny. Two years of Latin 48 years ago starts to wear off on the declensions. – lawyeR Feb 03 '15 at 21:20
  • @lawyeR, I started learning R little over a year ago and most of what I know by now is from following SO questions and trying to answer them. You are already active here so you'll quickly learn more functions, I'm sure :-) This is actually also what my user name is supposed to say - learning by "teaching" (=answering). – talat Feb 03 '15 at 21:26
  • @docendodiscimus: it is frustrating to see a question that I could tackle, but Bonded Dust, Richard Scriven, Mr. Flick, or akrun (or all four of them + you) have answered it already. Naturally, the OP wants an answer ASAP. I lack the discipline not to peek at answers. Perhaps SO needs a "Blank out the Answer and Let Me Think on My Own" button. It is seductive to get caught up in gaining rep. – lawyeR Feb 03 '15 at 21:48

1 Answers1

3

Starting from this data.frame:

> texts
#                   sometext1     sometext2
#1       this is a text entry    text entry
#2 here is another text entry    text entry
#3             something else no match here

You could use the following approach. Start by splitting the entries in each columns rows by spaces, using lapply:

x <- lapply(texts, strsplit, " ")

Then, use Map to apply intersect to the corresponding sub-elements of the first element in x (x[[1]]) - representing the first column in texts - and the second element in x (x[[2]]) - representing the second column in texts:

x <- Map(intersect, x[[1]], x[[2]])

Finally, use sapply to run through the list and paste/collapse the elements together and write them into the new column:

texts$common <- sapply(x, paste0, collapse = " ")

Result is:

> texts
#                   sometext1     sometext2     common
#1       this is a text entry    text entry text entry
#2 here is another text entry    text entry text entry
#3             something else no match here           
talat
  • 68,970
  • 21
  • 126
  • 157