1

I have been testing and looking in books and forums for hours without finding the answer, so here is the tricky question.

I am parsing an html file and BeautifulSoup gives me a txt and html versions of a text.

Now I want to split the text in sentences (according to [?!. ]* as end of sentence), so I have :

sentences_txt   = re.compile("[^?!.]+?[?!. ]*").findall(txt) # this work : return a list of sentences

and I want to make a list of the same number sentences but for their html counter part, like :

sentences_html  = re.compile("[^?!.]+?[?!. ]*").findall(html) # this doesn't work 

It doesn't work because when there are markups, it will split in the middle of the markup as soon as it find one of the character [?!.].

==> How can I split an html text according to [?!.] when they are not inside a markup ?

I tried some things using (?

sentences_html  = re.compile("(?:<.*>)*[^?!.]+?[?!. ]*").findall(html) # doesn't work 

sentences_html  = re.compile("(?<!<)[^?!.]+?(?!>)[?!. ]*").findall(html) # doesn't work 
Chris Wesseling
  • 6,226
  • 2
  • 36
  • 72
Romain Jouin
  • 4,448
  • 3
  • 49
  • 79
  • 1
    Required don't use regex comment http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Jakob Bowyer Feb 17 '14 at 09:02
  • 3
    @user3318273 Could you give an example string and the output you're looking for please? – Jerry Feb 17 '14 at 09:33
  • @JakobBowyer: That's my Pavlov reaction to the tag cobmination of [tag:html] and [tag:regex], but OP is doing the parsing with [tag:beautifulsoup]. – Chris Wesseling Apr 02 '14 at 16:02
  • @user3318273 do you expect the members of `sentences_html` to be well-formed HTML? Do you expect `sentences_html[0]` to contain the ``? – Chris Wesseling Apr 02 '14 at 16:20

0 Answers0