As per the title, I am trying to clean a large compilation of short texts, to remove sentences that start with certain words -- but only if it is the last of >1 sentences that text.
Suppose I want to cut out the last sentence if it begins with 'Jack is ...'
Here is an example with varied cases:
test_strings <- c("Jack is the tallest person.",
"and Jack is the one who said, let there be fries.",
"There are mirrors. And Jack is there to be suave.",
"There are dogs. And jack is there to pat them. Very cool.",
"Jack is your lumberjack. Jack, is super awesome.",
"Whereas Jack is, for the whole summer, sound asleep. Zzzz",
"'Jack is so cool!' Jack is cool. Jack is also cold."
)
And here is the regex I currently have: "(?![A-Z'].+[\\.|'] )[Jj]ack,? is.+\\.$"
map_chr(test_strings, ~str_replace(.x, "(?![A-Z'].+[\\.|'] )[Jj]ack,? is.+\\.$", "[TRIM]"))
Producing these results:
[1] "[TRIM]"
[2] "and [TRIM]"
[3] "There are mirrors. And [TRIM]"
[4] "There are dogs. And [TRIM]"
[5] "Jack is your lumberjack. [TRIM]"
[6] "Whereas Jack is, for the whole summer, sound asleep. Zzzz"
[7] "'Jack is so cool!' Jack is cool. [TRIM]"
## Basically my current regex is still too greedy.
## No trimming should happen for the first 4 examples.
## 5 - 7th examples are correct.
## Explanations:
# (1) Wrong. Only one sentence; do not trim, but current regex trims it.
# (2) Wrong. It is a sentence but does not start with 'Jack is'.
# (3) Wrong. Same situation as (2) -- the sentence starts with 'And' instead of 'Jack is'
# (4) Wrong. Same as (2) (3), but this time test with lowercase `jack`
# (5) Correct. Trim the second sentence as it is the last. Optional ',' removal is tested here.
# (6) Correct.
# (7) Correct. Sometimes texts do not begin with alphabets.
Thanks for any help!