Regex python and html : looking for dots outside markups?

Question

I have been testing and looking in books and forums for hours without finding the answer, so here is the tricky question.

I am parsing an html file and BeautifulSoup gives me a txt and html versions of a text.

Now I want to split the text in sentences (according to [?!. ]* as end of sentence), so I have :

sentences_txt   = re.compile("[^?!.]+?[?!. ]*").findall(txt) # this work : return a list of sentences

and I want to make a list of the same number sentences but for their html counter part, like :

sentences_html  = re.compile("[^?!.]+?[?!. ]*").findall(html) # this doesn't work

It doesn't work because when there are markups, it will split in the middle of the markup as soon as it find one of the character [?!.].

==> How can I split an html text according to [?!.] when they are not inside a markup ?

I tried some things using (?

sentences_html  = re.compile("(?:<.*>)*[^?!.]+?[?!. ]*").findall(html) # doesn't work 

sentences_html  = re.compile("(?<!<)[^?!.]+?(?!>)[?!. ]*").findall(html) # doesn't work

Required don't use regex comment http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Jakob Bowyer, Feb 17 '14 at 09:02
@user3318273 Could you give an example string and the output you're looking for please? — Jerry, Feb 17 '14 at 09:33
@JakobBowyer: That's my Pavlov reaction to the tag cobmination of [tag:html] and [tag:regex], but OP is doing the parsing with [tag:beautifulsoup]. — Chris Wesseling, Apr 02 '14 at 16:02
@user3318273 do you expect the members of `sentences_html` to be well-formed HTML? Do you expect `sentences_html[0]` to contain the ``? — Chris Wesseling, Apr 02 '14 at 16:20

Regex python and html : looking for dots outside markups?

0 Answers0