8

Say I have a string for example the following.

x <- 'The world is at end. What do you think?   I am going crazy!    These people are too calm.'

I need to split only on the punctuation !?. and following whitespace and keep the punctuation with it.

This removes the punctuation and leaves leading spaces in the split parts though

vec <- strsplit(x, '[!?.][:space:]*')

How can I split sentences leaving the punctuation?

hwnd
  • 69,796
  • 4
  • 95
  • 132
paulie.jvenuez
  • 295
  • 4
  • 11

5 Answers5

14

You can switch on PCRE by using perl=TRUE and use a lookbehind assertion.

strsplit(x, '(?<![^!?.])\\s+', perl=TRUE)

Regular expression:

(?<!          look behind to see if there is not:
 [^!?.]       any character except: '!', '?', '.'
)             end of look-behind
\s+           whitespace (\n, \r, \t, \f, and " ") (1 or more times)

Live Demo

hwnd
  • 69,796
  • 4
  • 95
  • 132
6

The sentSplit function in the qdap package was create just for this task:

library(qdap)
sentSplit(data.frame(text = x), "text")

##   tot                       text
## 1 1.1       The world is at end.
## 2 2.2         What do you think?
## 3 3.3          I am going crazy!
## 4 4.4 These people are too calm.
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
2

Take a look at this question. Character classes like [:space:] are defined within bracket expressions, so you need to enclose it in a set of brackets. Try:

vec <- strsplit(x, '[!?.][[:space:]]*')
vec
# [[1]]
# [1] "The world is at end"       "What do you think"        
# [3] "I am going crazy"          "These people are too calm"

This gets rid of the leading spaces. To keep punctuation, use a positive lookbehind assertion with perl = TRUE:

vec <- strsplit(x, '(?<=[!?.])[[:space:]]*', perl = TRUE)
vec
# [[1]]
# [1] "The world is at end."       "What do you think?"        
# [3] "I am going crazy!"          "These people are too calm."
Community
  • 1
  • 1
Blue Magister
  • 13,044
  • 5
  • 38
  • 56
  • He wanted the punctuation after the split also. – hwnd Nov 01 '13 at 03:56
  • Ah, got it. I'll edit - it'll look a lot like your answer, just with `[[:space:]]` instead of `\\s`. The overlap of answers isn't 100%, so I'm okay with it if you're okay with it. – Blue Magister Nov 01 '13 at 03:58
1

You could replace the spaces following punctuation marks with a string, e.g zzzzz and then split on that string.

x <- gsub("([!?.])[[:space:]]*","\\1zzzzz","The world is at end. What do you think?   I am going crazy!    These people are too calm.")
strsplit(x, "zzzzz")

Where \1 in the replacement string refers to the parenthesized sub-expression of the pattern.

hwnd
  • 69,796
  • 4
  • 95
  • 132
ndr
  • 1,427
  • 10
  • 11
1

As of qdap version 1.1.0 you can use the sent_detect function as follows:

library(qdap)
sent_detect(x)

## [1] "The world is at end."       "What do you think?"        
## [3] "I am going crazy!"          "These people are too calm."
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519