8

I would like to use R to compare written text and extract sections which differ between the elements.

Consider a and b two text paragraphs. One is a modified version of the other:

a <- "This part is the same. This part is old."
b <- "This string is updated. This part is the same."

I want to compare the two strings and receive the part of the string which is unique to either of the two as output, preferably separate for both input strings.

Expected output:

stringdiff <- list(a = " This part is old.", b = "This string is updated. ")

> stringdiff
$a
[1] " This part is old."

$b
[1] "This string is updated. "

I've tried a solution from Extract characters that differ between two strings, but this only compares unique characters. The answer in Simple Comparing of two texts in R comes closer, but still only compares unique words.

Is there any way to get the expected output without too much of a hassle?

LAP
  • 6,605
  • 2
  • 15
  • 28

1 Answers1

8

We concatenate both the strings, split at the space after the . to create a list of sentences ('lst'), get the unique elements from unlisting the 'lst' ('un1'), using setdiff we get the elements that are not in 'un1'

lst <- strsplit(c(a= a, b = b), "(?<=[.])\\s", perl = TRUE)
un1 <- unique(unlist(lst))
lapply(lst, setdiff, x= un1)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Works perfectly. It even renders the problem of white spaces before and after sentences moot, which I thought could become a problem. Thanks! – LAP Nov 29 '17 at 09:16
  • Long time no speak. I have a question. This `x` argument is taking a vector. According to the CRAN help, `x` is typically a list. Would you be able to imagine any other situations that you would use this `x` argument (a list or a vector)? – jazzurro Nov 30 '17 at 10:33
  • Which `x` do you mean? The one in `lapply(lst, setdiff, x= un1)`? If so, the `x= un1` defines the `x` for `setdiff(x, y)`, while the `lapply` over `lst` provides the `y`. – LAP Nov 30 '17 at 10:37
  • At the moment I see `$a [1] "This string is updated."` and `$b [1] "This part is old."` from the lapply. Am I missing something? – jazzurro Nov 30 '17 at 10:42
  • 1
    @LAP Thanks for the reply. I was trying to get a help from akrun. That explains what I was looking for. – jazzurro Nov 30 '17 at 10:43
  • @akrun Do you think there's a way to be able to identify if the exact same sentence occurs more often in one string than in the other? Just consider the situation above with `b <- "This string is updated. This part is the same. This part is the same."` – LAP Nov 30 '17 at 10:47
  • @LAP Perhaps you may need `duplicated` or so to identify those. – akrun Nov 30 '17 at 11:39
  • @jazzurro long time friend. Hope u r doing great. Yes, in this case it is a vector as LAP mentioned – akrun Nov 30 '17 at 11:39
  • 1
    Thanks, guess I'll figure it out if the need ever arises. – LAP Nov 30 '17 at 11:40
  • @akrun So we do not need to write an anonymous function then? – jazzurro Nov 30 '17 at 11:46
  • @jazzurro The anonymous function will make it more clear. But, if you look at the arguments of `setdiff(x, y)`, here we specified `x`, so it will figure out the y as the vector from `lst` element – akrun Nov 30 '17 at 11:47
  • @akrun that means you want to use `x` when you need to specify something in a function (e.g., setdiff, union and etc) that you want to use? – jazzurro Nov 30 '17 at 13:31
  • @jazzurro It is just that 'x' is the first argument in `?setdiff`. There is no way I can identify the 'y' with `lapply`. So, I specified 'x' and it automatically find the second argument as 'y'. Similarly for `union` or `intersect` as all of these have 'x' and 'y' as arguments. – akrun Nov 30 '17 at 15:31
  • @akrun That means lapply is taking `x` as the second mandatory component in a function like setdiff, right? I am going a bit too far, but what if we use a function which requires three components? Would lapply consider `x` as the 2nd component? – jazzurro Dec 01 '17 at 02:08
  • @jazzurro If there are 3 components lets say x, y, z arguments, and your 'x' is in the one like 'un1', specify that, then the vector from lapply and z another component, then `lapply(lst, yourfunc, x = un1, z= un2)`. Note that this will make sure that 'y' is the vector from the `list`. But, suppose if we don't specify 'z' argument, it could become confusing and will take based on the order of arguments – akrun Dec 01 '17 at 03:38