2

I have the following character vector:

"This is a very long character vector. Why is it so long? I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"

I want to split it into sentences by using the following pattern (i.e. period - space - upper case letter):

"This is a very long character vector."
"Why is it so long? I want to split this vector into senteces by using e.g. strssplit."
"Can someone help me?"
"That would be nice?"

Hence, period after abbrevations should not be a new sentence. I want to do this using regular expressions in R.

Can someone help me?

pogibas
  • 27,303
  • 19
  • 84
  • 117

2 Answers2

11

A solution using strsplit:

string <- "This is a very long character vector. Why is it so long? I think lng. is short for long. I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"
unlist(strsplit(string, "(?<=[[:punct:]])\\s(?=[A-Z])", perl=T))

Result:

[1] "This is a very long character vector."                             
[2] "Why is it so long?"                                                
[3] "I think lng. is short for long."                                   
[4] "I want to split this vector into senteces by using e.g. strssplit."
[5] "Can someone help me?"                                              
[6] "That would be nice?" 

This matches any punctuation character followed by a whitespace and a uppercase letter. (?<=[[:punct:]]) keeps the punctuation in the string before the matched delimiter and (?=[A-Z]) adds the matched uppercase letter to the string after the matched delimiter.

EDIT: I just saw you didn't split after a question mark in your desired output. If you only want to split after a "." you can use this:

unlist(strsplit(string, "(?<=\\.)\\s(?=[A-Z])", perl = T))

which gives

[1] "This is a very long character vector."                             
[2] "Why is it so long? I think lng. is short for long."                
[3] "I want to split this vector into senteces by using e.g. strssplit."
[4] "Can someone help me? That would be nice?"  
f.lechleitner
  • 3,554
  • 1
  • 17
  • 35
5

You could use the package tokenizers for that:

library(tokenizers)
tokenize_sentences(x)

where x is your character vector. It results in

[[1]]
[1] "This is a very long character vector."

[[2]]
[1] "Why is it so long?"                                                
[2] "I want to split this vector into senteces by using e.g. strssplit."

[[3]]
[1] "Can someone help me?"

[[4]]
[1] "That would be nice?"   

You could then use unlist to remove the list structure.

Karsten W.
  • 17,826
  • 11
  • 69
  • 103