You can split your strings using regex (with strsplit
), then use setdiff
to remove similarities between titles
and the result for strsplit
.
See code in use here
titles <- list("First Summary of Lorem Ipsum", "Second Summary of Lorem Ipsum")
s <- "First Summary of Lorem Ipsum
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
Second Summary of Lorem Ipsum
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."
a <- unlist(strsplit(s, "\\h*\\R\\h*\\R\\h*", perl=T))
setdiff(a, titles)
The above results in:
[1] "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book."
[2] "It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."
An explanation of the regex above \\h*\\R\\h*\\R\\h*
. I removed the double backslashes below for simplicity sake (it's only a character escape in R):
\h
Matches horizontal whitespace
*
Quantifies the previous token (in above regex \h
) to match it zero or more times
\R
Matches any Unicode newline sequence (\r\n
or \r
or \n
)
The regex matches two newlines (with any number of horizontal whitespace in or surrounding them just in case the input has something like \r\n\t\r\n
).
The non-Perl equivalent of this would be:
[ \\t]*(?:\\r\\n|[\\r\\n])[ \\t]*(?:\\r\\n|[\\r\\n])[ \\t]*