0

I'm trying to split sentences from academic papers. Traditionally splitting sentences would simply be:

sentence = 'This is a sentence. This is another sentence'
separate = sentence.split('.')

#  [ This is a sentence, This is another sentence ]

However, this logic does not work if you have sentences such as:

This is a sentence is a paper with a citation (author et al., 2020a) and it contains more 
information. This is similar to the examples I have (author et al., 2020a).

How could I split sentences (like the sample above) so the output would look something like this:

['This is a sentence is a paper with a citation (author et al., 2020a) and it contains more information' , 'This is similar to the examples I have (author et al., 2020a)' ]

What is an easy solution to this problem? Appreciate the suggestions.

Landon G
  • 819
  • 2
  • 12
  • 31

1 Answers1

2

A simple solution would be to split on "\. (?>[A-Z])" (dot space uppercase) :

sentences = values.split(r"\. (?>[A-Z])") # split nicely in the 2 sentences
sentences = values.split(r"\. ") # more basic and generic

A more powerful one is to use a dedicated lib like nltk : Python split text on sentences

azro
  • 53,056
  • 7
  • 34
  • 70