0

I am writing a script to split the text into sentences with Python. However I am quite bad with writing more complex regular expressions.

There are 5 rules according to which I wish to split the sentences. I want to split sentences if they:

* end with "!"  or
* end with "?"  or
* end with "..."  or
* end with "." and the full stop is not followed by a number  or
* end with "." and the full stop is followed by a whitespace

What would be the regular expression for this for Python?

hippietrail
  • 15,848
  • 18
  • 99
  • 158
Helena
  • 921
  • 1
  • 15
  • 24
  • Do you need to *retain* the ending characters? – Martijn Pieters Nov 08 '13 at 12:31
  • 3
    Showing your previous attempts will be a nice addition to the question :) – aIKid Nov 08 '13 at 12:36
  • So far I have a very basic code: import re splitter = r"\.(?!\d)" re.split(splitter, s) But it splits "U.S.A" into three sentences and "Hey..." is four sentences I don't need to retain the ending characters. – Helena Nov 08 '13 at 12:39
  • is using a library and option for you? If you are doing this to do some Natural Language processing I really advice you to take another approach. – Hossein Nov 08 '13 at 12:40
  • The task is to write a simple algorithm on your own, so a library is not an option – Helena Nov 08 '13 at 12:43
  • Guess this and assignment or homework then. Take a look at this, it may give you some idea: http://stackoverflow.com/questions/6745592/sentence-segmentation-using-regex?rq=1 – Hossein Nov 08 '13 at 12:48

1 Answers1

3

You can literally translate your five bullet points to a regular expression:

!|\?|\.{3}|\.\D|\.\s

Note that I'm simply creating an alternation consisting of five alternatives, each of which represents one of your bullet points:

  • !
  • \?
  • \.{3}
  • \.\D
  • \.\s

Since the dot (.) and the question mark (?) are special characters within a regular expression pattern, they need to be escaped by a backslash (\) to be treated as literals. The pipe (|) is the delimiting character between two alternatives.

Using the above regular expression, you can then split your text into sentences using re.split.

Marius Schulz
  • 15,976
  • 12
  • 63
  • 97