how to eliminate an specific part of html file in python

Question

I am working on a html file which has item 1, item 2, and item 3. I want to delete all the text that comes after the LAST item 2. There may be more than one item 2 in the file. I am using this but it does not work:

text = """<A href="#106">Item&nbsp;2. <B>Item&nbsp;2. Properties</B> this is an example this is an example"""

>>> a=re.search ('(?<=<B>)Item&nbsp;2.',text)
>>> b= a.group(0)
>>> newText= text.partition(b)[0]
>>> newText
'<A href="#106">'

it deletes the text after the first item 2 not the second one.

could you please show the string you expect in your question? — nio, Jul 27 '13 at 19:31
Please read the highest voted answer here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Hyperboreus, Jul 27 '13 at 19:36

score 1 · Answer 1 · answered Jul 28 '13 at 07:25

I'd use BeautifulSoup to parse the HTML and modify it. You might want to use the decompose() or extract() method.

BeautifulSoup is nice because it's pretty good at parsing malformed HTML.

For your specific example:

>>> import bs4
>>> text = """<A href="#106">Item&nbsp;2. <B>Item&nbsp;2. Properties</B> this is an example this is an example"""
>>> soup = bs4.BeautifulSoup(text)
>>> soup.b.next_sibling.extract()
u' this is an example this is an example'
>>> soup
<html><body><a href="#106">Item 2. <b>Item 2. Properties</b></a></body></html>

If you really wanna use regular expressions, a non-greedy regex would work for your example:

>>> import re
>>> text = """<A href="#106">Item&nbsp;2. <B>Item&nbsp;2. Properties</B> this is an example this is an example"""
>>> m = re.match(".*?Item&nbsp;2\.", text)
>>> m.group(0)
'<A href="#106">Item&nbsp;2.'

how to eliminate an specific part of html file in python

1 Answers1