2

I am doing a regex to only extract the first sentence of each paragraph. At the moment, I have a input vector like this:

text_insert <- c("hello, i am working through an r workbook. I am doing a regex expression.", "hi, how are you? I am great working through r")

My R code at the moment is:

gsub(pattern = "\\..*", replacement = ".", x = text_insert)

However this does not recognize ? or ! as the end of a sentence.

Any help of how to recognize ! and ? as the end of sentence as well?

fmcgarry
  • 92
  • 2
  • 9
  • 5
    Since your regex mentions neither `?` nor `!`, why do you expect it to find them? Your attempt is in some sense not a serious attempt at the problem. – John Coleman Nov 26 '19 at 15:06
  • How about `gsub(pattern = "([\\.\\?\\!]).*", replacement = "\\1", x = text_insert)`? – ThomasIsCoding Nov 26 '19 at 15:06
  • Does this answer your question? [Split character vector into sentences](https://stackoverflow.com/questions/46884556/split-character-vector-into-sentences) – camille Nov 26 '19 at 15:17
  • 1
    In what context is a period _not_ the end of a sentence ? –  Nov 26 '19 at 17:06
  • @Frazer Bayliss, are you saying your first sentence is till `?` in your example that you posted up in the question? – JBone Nov 26 '19 at 19:09

2 Answers2

2

You can use | to search for alternatives with a regular expression:

(\\.|!|?).*

Alternatively, you can use a character class ([…]) to look for “any one symbol inside the character class”:

[.!?].*

. does not need to be escaped when inside a character class.


Lastly, gsub is great for replacing text but what you’re actually doing is searching for text. There are better functions for that; it’s just that, in base R, they’re very inconvenient to use. However, we can use a package (e.g. stringr) to easily find matches.

Using this method means that you can describe much more directly what you’re searching for: a sequence of characters, finished by a punctuation mark:

〉stringr::str_match(text_insert, '.*?[.!?]')
     [,1]
[1,] "hello, i am working through an r workbook."
[2,] "hi, how are you?"

Note the .*?: *? is the same as *, except non-greedy (aka. “lazy”). This means that the match will stop as soon as the first instance of any of .!? is found.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
0

According to OP, The first sentence ends at ?. A bit odd, but this is his requirement from the question

/^([^?!]*)/

captures the first sentence right upto ?

Explanation:

/^    -- beginning of the string, to capture the first sentence.
[^?!]*  -- move till you find either ? or !. Note that ^ in character class represents negation , meaning [NOT ? or !]

here is the demonstration on regex101

JBone
  • 1,724
  • 3
  • 20
  • 32