0

I am working on a html file which has item 1, item 2, and item 3. I want to delete all the text that comes after the LAST item 2. There may be more than one item 2 in the file. I am using this but it does not work:

text = """<A href="#106">Item&nbsp;2. <B>Item&nbsp;2. Properties</B> this is an example this is an example"""

>>> a=re.search ('(?<=<B>)Item&nbsp;2.',text)
>>> b= a.group(0)
>>> newText= text.partition(b)[0]
>>> newText
'<A href="#106">'

it deletes the text after the first item 2 not the second one.

mehrblue
  • 35
  • 1
  • 4

1 Answers1

1

I'd use BeautifulSoup to parse the HTML and modify it. You might want to use the decompose() or extract() method.

BeautifulSoup is nice because it's pretty good at parsing malformed HTML.

For your specific example:

>>> import bs4
>>> text = """<A href="#106">Item&nbsp;2. <B>Item&nbsp;2. Properties</B> this is an example this is an example"""
>>> soup = bs4.BeautifulSoup(text)
>>> soup.b.next_sibling.extract()
u' this is an example this is an example'
>>> soup
<html><body><a href="#106">Item 2. <b>Item 2. Properties</b></a></body></html>

If you really wanna use regular expressions, a non-greedy regex would work for your example:

>>> import re
>>> text = """<A href="#106">Item&nbsp;2. <B>Item&nbsp;2. Properties</B> this is an example this is an example"""
>>> m = re.match(".*?Item&nbsp;2\.", text)
>>> m.group(0)
'<A href="#106">Item&nbsp;2.'
dhui
  • 492
  • 4
  • 7