1

I'm writing a python script that parses epub 2's and I'm trying to make it so I can split words, sentences and paragraphs in to their own objects... I've gotten words and paragraphs to work, but the problem lies in sentences, because sometimes there will be ". . ." at the end of the sentence as the delimiter. but the problem is that I'm parsing character by character, so when I hit a ".","!", or "?" my system counts it as an end of sentence... I was thinking about writing some complex if statements that can read the previous character to see if it was a space or a sentence delimiter, but every thing I've tried does not work. any advice on this would be greatly appreciated. one thing that I should mention is that I'm not using regex nor will I, because it would not work with this system.

here is the code I've been trying to use:

def add_until(self):

    char_list = []
    end_sentence = False

    for char in self.source:

        if isinstance(char, Character) or isinstance(char, EntityRef):
            char_list.append(char)

            if len(char_list) >= 2 and char_list[-2].is_whitespace or len(char_list) >= 2 and char_list[-2].split_sent and char.is_whitespace or char.split_sent: 
                  char_list.append(char)


            if len(char_list) >= 2 and char_list[-2].is_whitespace and char.split_sent == False and char.is_whitespace == False:
                 char_list.pop() # pop's the last space off because it should be part of the next sentience. 
jsucsy
  • 469
  • 5
  • 12
AlexW.H.B.
  • 1,781
  • 3
  • 25
  • 50

1 Answers1

1

You need to use greedy string matching. Usually, to do this sort of things, I slice the string into chunks and iterate over them while reducing their length if necessary. With your example :

source = """This is a sentence... This is a second sentence.
         Is this a sentence? Sure it is!!!"""

stop = ('...', '.', '?', '!', '!!!')

i = 0
M = max(len(s) for s in stop)
L = len(source)

while i <= L:
    m = M
    while m > 0:
        chunk = source[i:i + m]
        if chunk in stop:
            print("end of sentence with: %s" % chunk)
            break
        m -= 1
    else:
        m = 1
    i += m

This outputs:

end of sentence with: ...
end of sentence with: .
end of sentence with: ?
end of sentence with: !!!

You may also want to check if the first non-blank char after "end-of-sentence" token is uppercase (or a digit).

Edit

Sample example of preprocessor, for stripping non-needed blanks:

def read(source):
    had_blank = False
    for char in source:
        if char == ' ':
            had_blank = True
        else:
            if had_blank and not char in '.?!':
                yield ' '
                yield char
                had_blank = False
            else:
                yield char

Using it:

>>> source = "Sentence1  .. . word1    word2.    . .  word other_word  . .   ."
>>> ''.join(c for c in read(source))
'Sentence1... word1 word2... word other_word...'
michaelmeyer
  • 7,985
  • 7
  • 30
  • 36
  • excellent thanks for your response... I like this solution... though there is one thing that this might not do... I've run into ellipses with spaces in between each period, would that still work if I add that to the stop list? – AlexW.H.B. Jun 06 '13 at 19:58
  • Yes, but you might then end up with a lot of different "stop" token if you want to cover up all the possible cases (like ". . .", ". ..", ".. .", etc.). An alternative would be to use a "reader" function which would yield one character at a time, and signal blanks. – michaelmeyer Jun 06 '13 at 20:13
  • that's true. how would that look in practice? would you put the source into a generator so you can do somevar.next() and test if it's a space or a character or a sentience break? – AlexW.H.B. Jun 06 '13 at 20:23
  • you are my hero... no joke I've been trying to think of this type of solution for over a day now. thank you so much! – AlexW.H.B. Jun 06 '13 at 21:17