Split character vector into sentences

Question

I have the following character vector:

"This is a very long character vector. Why is it so long? I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"

I want to split it into sentences by using the following pattern (i.e. period - space - upper case letter):

"This is a very long character vector."
"Why is it so long? I want to split this vector into senteces by using e.g. strssplit."
"Can someone help me?"
"That would be nice?"

Hence, period after abbrevations should not be a new sentence. I want to do this using regular expressions in R.

Can someone help me?

What about question mark? You split after it in two different ways? — pogibas, Oct 23 '17 at 08:09

f.lechleitner · Accepted Answer · 2017-10-23T09:06:38.647

A solution using strsplit:

string <- "This is a very long character vector. Why is it so long? I think lng. is short for long. I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"
unlist(strsplit(string, "(?<=[[:punct:]])\\s(?=[A-Z])", perl=T))

Result:

[1] "This is a very long character vector."                             
[2] "Why is it so long?"                                                
[3] "I think lng. is short for long."                                   
[4] "I want to split this vector into senteces by using e.g. strssplit."
[5] "Can someone help me?"                                              
[6] "That would be nice?"

This matches any punctuation character followed by a whitespace and a uppercase letter. (?<=[[:punct:]]) keeps the punctuation in the string before the matched delimiter and (?=[A-Z]) adds the matched uppercase letter to the string after the matched delimiter.

EDIT: I just saw you didn't split after a question mark in your desired output. If you only want to split after a "." you can use this:

unlist(strsplit(string, "(?<=\\.)\\s(?=[A-Z])", perl = T))

which gives

[1] "This is a very long character vector."                             
[2] "Why is it so long? I think lng. is short for long."                
[3] "I want to split this vector into senteces by using e.g. strssplit."
[4] "Can someone help me? That would be nice?"

score 5 · Answer 2 · answered Oct 23 '17 at 08:29

You could use the package tokenizers for that:

library(tokenizers)
tokenize_sentences(x)

where x is your character vector. It results in

[[1]]
[1] "This is a very long character vector."

[[2]]
[1] "Why is it so long?"                                                
[2] "I want to split this vector into senteces by using e.g. strssplit."

[[3]]
[1] "Can someone help me?"

[[4]]
[1] "That would be nice?"

You could then use unlist to remove the list structure.

Split character vector into sentences

2 Answers2

Linked