3

I have a text that splits into many lines, no particular formats. So I decided to line.strip('\n') for each line. Then I want to split the text into sentences using the sentence end marker . considering:

  1. period . that is followed by a \s (whitespace), \S (like " ') and followed by [A-Z] will split
  2. not to split [0-9]\.[A-Za-z], like 1.stackoverflow real time solution.

My program only solve half of 1 - period (.) that is followed by a \s and [A-Z]. Below is the code:

# -*- coding: utf-8 -*-
import re, sys

source = open(sys.argv[1], 'rb')
dest = open(sys.argv[2], 'wb')
sent = []
for line in source:
    line1 = line.strip('\n')
    k = re.sub(r'\.\s+([A-Z“])'.decode('utf8'), '.\n\g<1>', line1)
    sent.append(k)

for line in sent:
    dest.write(''.join(line))

Pls! I'd like to know which is the best way to master regex. It seems to be confusing.

hippietrail
  • 15,848
  • 18
  • 99
  • 158
Iykeln
  • 149
  • 2
  • 2
  • 8

1 Answers1

4

To include the single quote in the character class, escape it with a \. The regex should be:

\.\s+[A-Z"\']

That's really all you need. You only need to tell a regex what to match, you don't need to specify what you don't want to match. Everything that doesn't fit the pattern won't match.

This regex will match any period followed by whitespace followed by a capital letter or a quote. Since a period immediately preceded by an number and immediately followed by a letter doesn't meet those criteria, it won't match.

This is assuming that the regex you had was working to split a period followed by whitespace followed by a capital, as you stated. Note, however, that this means that I am Sam. Sam I am. would split into I am Sam and am I am. Is that really what you want? If not, use zero-width assertions to exclude the parts you want to match but also keep. Here are your options, in order of what I think it's most likely you want.

1) Keep the period and the first letter or opening quote of the next sentence; lose the whitespace:

(?<=\.)\s+(?=[A-Z"\'])

This will split the example above into I am Sam. and Sam I am.

2) Keep the first letter of the next sentence; lose the period and whitespace:

\.\s+(?=[A-Z"\'])

This will split into I am Sam and Sam I am. This presumes that there are more sentences afterward, otherwise the period will stay with the second sentence, because it's not followed by whitespace and a capital letter or quote. If this option is the one you want - the sentences without the periods, then you might want to also match a period followed by the end of the string, with optional intervening whitespace, so that the final period and any trailing whitespace will be dropped:

\.(?:\s+(?=[A-Z"\'])|\s*$)

Note the ?:. You need non-capturing parentheses, because if you have capture groups in a split, anything captured by the group is added as an element in the results (e.g. split('(+)', 'a+b+c' gives you an array of a + b + c rather than just a b c).

3) Keep everything; whitespace goes with the preceding sentence:

(?<=\.\s+)(?=[A-Z"\'])

This will give you I am Sam. and Sam I am.

Regarding the last part of your question, the best resource for regex syntax I've seen is http://www.regular-expressions.info. Start with this summary: http://www.regular-expressions.info/reference.html Then go to the Tutorial page for more advanced details: http://www.regular-expressions.info/tutorial.html

Adi Inbar
  • 12,097
  • 13
  • 56
  • 69
  • If you use this expression to split the string then won't you lose the period and more importantly the first character from the next line in the split process? It might be better to use a look-ahead like `\.\s+(?=[A-Z"\'])`. – Ro Yo Mi Aug 05 '13 at 14:47
  • I'd think so - that would seem more useful to me and seems to be implied, but he said that his regex *was* working as he expected for "half of 1" (i.e. the whole thing, because #2 was a specific case of what he doesn't want to match, which doesn't need to be specified). The only thing missing was the escaped single quote. I figured if that wasn't what he wanted he could come back and say so, but that would contradict what he said about having solved splitting on a period followed by whitespace followed by a capital. But I'll edit the answer to include all the options for completeness. – Adi Inbar Aug 05 '13 at 18:30
  • Hi, the regex '\.\s+([A-Z“\'])' does the sentence splitting but I notice the following errors, like splitting [0-9].[A-Z] and not splitting '.' after "" followed by a [A-Z]. Like in this case: "Under the direction of His Father, He was the creator of the earth. “All things were made by him; and without him was not any thing made that was made” (John 1:3)." I am sorry for any delay. Thanks @All. I tried the look-ahead expression and got: sre_constants.error: invalid group reference. – Iykeln Aug 05 '13 at 19:12
  • It's not clear what you mean by the example, because you didn't specify what happens. Can you add examples to your question and specify what result set you get? That regex absolutely should not be splitting on number-period-capital, because at least one whitespace character is *required* after the period. Also, note the warning I gave about using capture groups in splits. Why are you putting the character class in a capture group? It should be in a lookahead assertion. Try my #2 regex. – Adi Inbar Aug 05 '13 at 19:28
  • Oh, I just noticed that you're matching opening curly quotes, not straight quotes. Try changing the opening double quote to `\u201d` and the single quote to `\u2018` (just one slash). But definitely also change the capture group into a lookahead assertion. – Adi Inbar Aug 05 '13 at 19:35
  • regex #2 is giving raise error, v # invalid expression and sre_constants.error: invalid group reference. The '.' that followed “[A-Z] like here ...earth. “All... isn't splitting. From “All... suppose to split into a sentence @Adi Inbar. You can try it it below text. – Iykeln Aug 05 '13 at 22:13
  • As we commemorate the birth of Jesus Christ two millennia ago, we offer our testimony of the reality of His matchless life and the infinite virtue of His great atoning sacrifice. None other has had so profound an influence upon all who have lived and will yet live upon the earth. He was the Great Jehovah of the Old Testament, the Messiah of the New. Under the direction of His Father, He was the creator of the earth. “All things were made by him; and without him was not any thing made that was made” (John 1:3). – Iykeln Aug 05 '13 at 22:21
  • But which regex are you using? The one in your comment is no good, and the ones in my answer were for the straight quotes you had in your description, not the curly quotes you have in the examples in the comments. It should be `'\.\s+(?=[A-Z\u2018\u201d])'`, or if you also want to split on straight quotes, make that `'\.\s+(?=[A-Z\u2018\u201d\'"])'` – Adi Inbar Aug 05 '13 at 22:48