R string removes punctuation on split

Question

Say I have a string for example the following.

x <- 'The world is at end. What do you think?   I am going crazy!    These people are too calm.'

I need to split only on the punctuation !?. and following whitespace and keep the punctuation with it.

This removes the punctuation and leaves leading spaces in the split parts though

vec <- strsplit(x, '[!?.][:space:]*')

How can I split sentences leaving the punctuation?

hwnd · Accepted Answer · 2016-12-06T02:35:39.470

14

You can switch on PCRE by using perl=TRUE and use a lookbehind assertion.

strsplit(x, '(?<![^!?.])\\s+', perl=TRUE)

Regular expression:

(?<!          look behind to see if there is not:
 [^!?.]       any character except: '!', '?', '.'
)             end of look-behind
\s+           whitespace (\n, \r, \t, \f, and " ") (1 or more times)

Live Demo

edited Dec 06 '16 at 02:35

answered Nov 01 '13 at 03:12

hwnd

69,796
4
95
132

is the reason that the `'.'` does not need to be escaped because it is in a `[ ]` group or for some other reason? – Ricardo Saporta Nov 01 '13 at 03:28
For PCRE, and other so-called Perl-compatible flavors, escape `.^$*+?()[{\|` outside character class and `^`, `-`, `]`, \ inside character class. – hwnd Nov 01 '13 at 03:31
So if i set perl = true i can use different assertions? – paulie.jvenuez Nov 01 '13 at 03:39

score 6 · Answer 2 · answered Nov 01 '13 at 03:21

The sentSplit function in the qdap package was create just for this task:

library(qdap)
sentSplit(data.frame(text = x), "text")

##   tot                       text
## 1 1.1       The world is at end.
## 2 2.2         What do you think?
## 3 3.3          I am going crazy!
## 4 4.4 These people are too calm.

score 2 · Answer 3 · edited May 23 '17 at 12:08

2

Take a look at this question. Character classes like [:space:] are defined within bracket expressions, so you need to enclose it in a set of brackets. Try:

vec <- strsplit(x, '[!?.][[:space:]]*')
vec
# [[1]]
# [1] "The world is at end"       "What do you think"        
# [3] "I am going crazy"          "These people are too calm"

This gets rid of the leading spaces. To keep punctuation, use a positive lookbehind assertion with perl = TRUE:

vec <- strsplit(x, '(?<=[!?.])[[:space:]]*', perl = TRUE)
vec
# [[1]]
# [1] "The world is at end."       "What do you think?"        
# [3] "I am going crazy!"          "These people are too calm."

edited May 23 '17 at 12:08

Community

1
1

answered Nov 01 '13 at 03:46

Blue Magister

13,044
5
38
56

He wanted the punctuation after the split also. – hwnd Nov 01 '13 at 03:56
Ah, got it. I'll edit - it'll look a lot like your answer, just with `[[:space:]]` instead of `\\s`. The overlap of answers isn't 100%, so I'm okay with it if you're okay with it. – Blue Magister Nov 01 '13 at 03:58

score 1 · Answer 4 · edited Nov 01 '13 at 04:06

1

You could replace the spaces following punctuation marks with a string, e.g zzzzz and then split on that string.

x <- gsub("([!?.])[[:space:]]*","\\1zzzzz","The world is at end. What do you think?   I am going crazy!    These people are too calm.")
strsplit(x, "zzzzz")

Where \1 in the replacement string refers to the parenthesized sub-expression of the pattern.

edited Nov 01 '13 at 04:06

hwnd

69,796
4
95
132

answered Nov 01 '13 at 03:59

ndr

1,427
10
11

score 1 · Answer 5 · answered Feb 26 '14 at 23:18

1

As of qdap version 1.1.0 you can use the sent_detect function as follows:

library(qdap)
sent_detect(x)

## [1] "The world is at end."       "What do you think?"        
## [3] "I am going crazy!"          "These people are too calm."

answered Feb 26 '14 at 23:18

Tyler Rinker

108,132
65
322
519

Also, as of 2.2.1, sent_detect_nlp – demongolem Nov 02 '16 at 20:05

R string removes punctuation on split

5 Answers5

Linked