I'm writing a python script that parses epub 2's and I'm trying to make it so I can split words, sentences and paragraphs in to their own objects... I've gotten words and paragraphs to work, but the problem lies in sentences, because sometimes there will be ". . ." at the end of the sentence as the delimiter. but the problem is that I'm parsing character by character, so when I hit a ".","!", or "?" my system counts it as an end of sentence... I was thinking about writing some complex if statements that can read the previous character to see if it was a space or a sentence delimiter, but every thing I've tried does not work. any advice on this would be greatly appreciated. one thing that I should mention is that I'm not using regex nor will I, because it would not work with this system.
here is the code I've been trying to use:
def add_until(self):
char_list = []
end_sentence = False
for char in self.source:
if isinstance(char, Character) or isinstance(char, EntityRef):
char_list.append(char)
if len(char_list) >= 2 and char_list[-2].is_whitespace or len(char_list) >= 2 and char_list[-2].split_sent and char.is_whitespace or char.split_sent:
char_list.append(char)
if len(char_list) >= 2 and char_list[-2].is_whitespace and char.split_sent == False and char.is_whitespace == False:
char_list.pop() # pop's the last space off because it should be part of the next sentience.