0

I am learning regex with experimenting on HTML files, I have a regex problem,

My text is :

text='12<a>1<a>2</a>3</a>13<a>4<a>5</a>6</a>14'    

The expression

<a><a></a></a> is nested

I want to write a regex that can handle nested expression, for example, my output for the above text should be

Output :    121314

I use the regex,

re.sub('<a>(.+?)</a>', '', text, flags=re.DOTALL)

I get an output:

'123</a>136</a>14'

This is because the regex is unable to handle nested expression.

Sam
  • 2,545
  • 8
  • 38
  • 59
  • 5
    `I am learning regex with experimenting on HTML files`, bad idea.. – Avinash Raj Oct 27 '15 at 08:59
  • How about extracting only digits `re.sub('(\d+)', '', text, flags=re.DOTALL)` – Tushar Oct 27 '15 at 08:59
  • Read http://stackoverflow.com/a/1732454/194635 – Stas Oct 27 '15 at 09:14
  • you should try lxml or pyquery – Mithril Oct 27 '15 at 09:17
  • That's so crazy each time there's regex and html in a post someone link to this. I agree parsing html with regex is a bad idea, but here the OP is asking to learn about regex, mainly a recursive regex in fact... – Tensibai Oct 27 '15 at 09:24
  • I can't see anything wrong with using regex to parse HTML. So long as the task is easy enough that regex could handle. I also can't see anything wrong with learning regex by experimenting on HTML. It stimulates so much creativities. – Khoi Oct 27 '15 at 09:45
  • @Khoi Atually doing a recursive regex on brackets or anything opening/closing is doable, with HTML tags it gets harder by nature as the "separator" are of multiple chars. – Tensibai Oct 27 '15 at 09:47
  • @Tensibai are you saying that it's hard because it's HTML? Again, solving hard problem bring greater reward. – Khoi Oct 27 '15 at 09:51
  • Python `re` regex module does not support recursive regex. You need to write your own parser here. With HTML strings, it means you should be using an HTML parser. – Wiktor Stribiżew Oct 27 '15 at 09:53
  • @Khoi just saying it's hard because HTML tags are not single char like brackets or parentheses, so avoiding them inside the tags is harder. That's all – Tensibai Oct 27 '15 at 09:57

2 Answers2

1

How about this?

while re.search(r'<a>\d*</a>', text):
    text = re.sub(r'<a>\d*</a>', '', text)
Khoi
  • 4,502
  • 6
  • 29
  • 31
  • Thanks, Your answer handles the scenario perfectly. The problem is that I have simplified the tag to a greater extent and posted. Their can be digits, character even tags like between and . – Sam Oct 27 '15 at 10:08
  • @Sam In this case follow the previous advices, don't use regex but a parser. – Tensibai Oct 27 '15 at 10:13
  • Actually, I am trying to fetch keywords from wikipedia page that are bold, itallic, hyperlinked, headings....,,, at the same time I have to avoid the Images, Contents and tables from the wikkipedia page. I am not sure if it can be achieved by using a psrser – Sam Oct 27 '15 at 10:19
  • @Sam In that case you should have posted the full problem, albeit with carefully arranged scenario. Hope that did help you to some extent. – Khoi Oct 28 '15 at 01:19
-1
re.sub(r"\b\d{0,1}\b<\/?a>\b\d{0,1}\b", r"", text)
Mayur Koshti
  • 1,794
  • 15
  • 20