I have been testing and looking in books and forums for hours without finding the answer, so here is the tricky question.
I am parsing an html file and BeautifulSoup gives me a txt and html versions of a text.
Now I want to split the text in sentences (according to [?!. ]* as end of sentence), so I have :
sentences_txt = re.compile("[^?!.]+?[?!. ]*").findall(txt) # this work : return a list of sentences
and I want to make a list of the same number sentences but for their html counter part, like :
sentences_html = re.compile("[^?!.]+?[?!. ]*").findall(html) # this doesn't work
It doesn't work because when there are markups, it will split in the middle of the markup as soon as it find one of the character [?!.].
==> How can I split an html text according to [?!.] when they are not inside a markup ?
I tried some things using (?
sentences_html = re.compile("(?:<.*>)*[^?!.]+?[?!. ]*").findall(html) # doesn't work
sentences_html = re.compile("(?<!<)[^?!.]+?(?!>)[?!. ]*").findall(html) # doesn't work