0

As per the title, I am trying to clean a large compilation of short texts, to remove sentences that start with certain words -- but only if it is the last of >1 sentences that text.

Suppose I want to cut out the last sentence if it begins with 'Jack is ...'
Here is an example with varied cases:

test_strings <- c("Jack is the tallest person.", 
                  "and Jack is the one who said, let there be fries.", 
                  "There are mirrors. And Jack is there to be suave.", 
                  "There are dogs. And jack is there to pat them. Very cool.", 
                  "Jack is your lumberjack. Jack, is super awesome.",
                  "Whereas Jack is, for the whole summer, sound asleep. Zzzz", 
                  "'Jack is so cool!' Jack is cool. Jack is also cold."
                  )

And here is the regex I currently have: "(?![A-Z'].+[\\.|'] )[Jj]ack,? is.+\\.$"

map_chr(test_strings, ~str_replace(.x, "(?![A-Z'].+[\\.|'] )[Jj]ack,? is.+\\.$", "[TRIM]"))

Producing these results:

[1] "[TRIM]"                                                   
[2] "and [TRIM]"                                               
[3] "There are mirrors. And [TRIM]"                            
[4] "There are dogs. And [TRIM]"                               
[5] "Jack is your lumberjack. [TRIM]"                          
[6] "Whereas Jack is, for the whole summer, sound asleep. Zzzz"
[7] "'Jack is so cool!' Jack is cool. [TRIM]"  


## Basically my current regex is still too greedy. 
## No trimming should happen for the first 4 examples. 
## 5 - 7th examples are correct. 

## Explanations:
# (1) Wrong. Only one sentence; do not trim, but current regex trims it. 
# (2) Wrong. It is a sentence but does not start with 'Jack is'.
# (3) Wrong. Same situation as (2) -- the sentence starts with 'And' instead of 'Jack is'
# (4) Wrong. Same as (2) (3), but this time test with lowercase `jack`
# (5) Correct. Trim the second sentence as it is the last. Optional ',' removal is tested here.
# (6) Correct.
# (7) Correct. Sometimes texts do not begin with alphabets. 

Thanks for any help!

fluent
  • 55
  • 1
  • 6

2 Answers2

3
gsub("^(.*\\.)\\s*Jack,? is[^.]*\\.?$", "\\1 [TRIM]", test_strings, ignore.case = TRUE)
# [1] "Jack is the tallest person."                              
# [2] "and Jack is the one who said, let there be fries."        
# [3] "There are mirrors. And Jack is there to be suave."        
# [4] "There are dogs. And jack is there to pat them. Very cool."
# [5] "Jack is your lumberjack. [TRIM]"                          
# [6] "Whereas Jack is, for the whole summer, sound asleep. Zzzz"
# [7] "'Jack is so cool!' Jack is cool. [TRIM]"                  

Break-down:

  • ^(.*\\.)\\s*: since we need there to be at least one sentence before what we trim out, we need to find a preceding dot \\.;
  • Jack,? is from your regex
  • [^.]*\\.?$: zero or more "not .-dots" followed by a .-dot and end-of-string; if you want to allow blank space after the last period, then you can change this to [^.]*\\.?\\s*$, didn't seem necessary in your example
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thanks, it works perfect! Just two questions: what does the \\1 do, and how can I invert this regex to use with str_extract to extract the trimmed sentences? – fluent Oct 31 '21 at 02:48
  • `\\1` is a backreference to the first `(...)` block in the pattern. See https://stackoverflow.com/a/22944075/3358272 for a good regex (re)fresher. If you mean to extract the `Jack is ...` sentence, then `str_extract(test_strings, "(?<=[.])\\s*Jack,? is [^.]*\\.?$")`; if you mean extract other than that, then `str_replace` is for you, which is functionally equivalent to `gsub` (though, with *this* test, 5x *slower*). – r2evans Oct 31 '21 at 10:19
2

You can match a dot (or match more chars using a character class [.!?] and then match the last sentence containing Jack and end with a dot (or again the character class to match more chars):

\.\K\h*[Jj]ack,? is[^.\n]*\.$

The pattern matches:

  • \.\K Match a . and forget what is matched so far
  • \h*[Jj]ack,? is Match optional horizontal whitespace chars, then Jack or jack, and optional comma and is
  • [^.\n]*\. Optionally match any char except a . or a newline
  • $ End of string

Regex demo | R demo

Example code:

test_strings <- c("Jack is the tallest person.", 
                  "and Jack is the one who said, let there be fries.", 
                  "There are mirrors. And Jack is there to be suave.", 
                  "There are dogs. And jack is there to pat them. Very cool.", 
                  "Jack is your lumberjack. Jack, is super awesome.",
                  "Whereas Jack is, for the whole summer, sound asleep. Zzzz", 
                  "'Jack is so cool!' Jack is cool. Jack is also cold."
                  )

sub("\\.\\K\\h*[Jj]ack,? is[^.\\n]*\\.$", " [TRIM]", test_strings, perl=TRUE)

Output

[1] "Jack is the tallest person."                              
[2] "and Jack is the one who said, let there be fries."        
[3] "There are mirrors. And Jack is there to be suave."        
[4] "There are dogs. And jack is there to pat them. Very cool."
[5] "Jack is your lumberjack. [TRIM]"                          
[6] "Whereas Jack is, for the whole summer, sound asleep. Zzzz"
[7] "'Jack is so cool!' Jack is cool. [TRIM]"
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • Off hand, do you know if there's a functional difference in the lookbehind using `.\K` versus `(?<=.)`? – r2evans Oct 31 '21 at 10:30
  • @r2evans Eventually they will get the same result, see https://regex101.com/r/W4mS5i/1 and https://regex101.com/r/4hGn7m/1 But the difference is that when using the lookbehind, the lookbehind is triggered on every step until it asserts a . to the left. – The fourth bird Oct 31 '21 at 10:39
  • 1
    I had verified that it produces the same results, and now see that it is (at least in R using `replicate(1000,test_strings)`) it is nearly 1/3 the runtime. Very interesting, I hadn't known that. Thanks! – r2evans Oct 31 '21 at 10:43