Handling Nested Regular Expression

Question

I am learning regex with experimenting on HTML files, I have a regex problem,

My text is :

text='12<a>1<a>2</a>3</a>13<a>4<a>5</a>6</a>14'

The expression

<a><a></a></a> is nested

I want to write a regex that can handle nested expression, for example, my output for the above text should be

Output :    121314

I use the regex,

re.sub('<a>(.+?)</a>', '', text, flags=re.DOTALL)

I get an output:

'123</a>136</a>14'

This is because the regex is unable to handle nested expression.

`I am learning regex with experimenting on HTML files`, bad idea.. — Avinash Raj, Oct 27 '15 at 08:59
How about extracting only digits `re.sub('(\d+)', '', text, flags=re.DOTALL)` — Tushar, Oct 27 '15 at 08:59
That's so crazy each time there's regex and html in a post someone link to this. I agree parsing html with regex is a bad idea, but here the OP is asking to learn about regex, mainly a recursive regex in fact... — Tensibai, Oct 27 '15 at 09:24
I can't see anything wrong with using regex to parse HTML. So long as the task is easy enough that regex could handle. I also can't see anything wrong with learning regex by experimenting on HTML. It stimulates so much creativities. — Khoi, Oct 27 '15 at 09:45
@Khoi Atually doing a recursive regex on brackets or anything opening/closing is doable, with HTML tags it gets harder by nature as the "separator" are of multiple chars. — Tensibai, Oct 27 '15 at 09:47
@Tensibai are you saying that it's hard because it's HTML? Again, solving hard problem bring greater reward. — Khoi, Oct 27 '15 at 09:51
Python `re` regex module does not support recursive regex. You need to write your own parser here. With HTML strings, it means you should be using an HTML parser. — Wiktor Stribiżew, Oct 27 '15 at 09:53
@Khoi just saying it's hard because HTML tags are not single char like brackets or parentheses, so avoiding them inside the tags is harder. That's all — Tensibai, Oct 27 '15 at 09:57

score 1 · Accepted Answer · answered Oct 27 '15 at 09:10

1

How about this?

while re.search(r'<a>\d*</a>', text):
    text = re.sub(r'<a>\d*</a>', '', text)

answered Oct 27 '15 at 09:10

Khoi

Thanks, Your answer handles the scenario perfectly. The problem is that I have simplified the tag to a greater extent and posted. Their can be digits, character even tags like between and . – Sam Oct 27 '15 at 10:08
@Sam In this case follow the previous advices, don't use regex but a parser. – Tensibai Oct 27 '15 at 10:13
Actually, I am trying to fetch keywords from wikipedia page that are bold, itallic, hyperlinked, headings....,,, at the same time I have to avoid the Images, Contents and tables from the wikkipedia page. I am not sure if it can be achieved by using a psrser – Sam Oct 27 '15 at 10:19
@Sam In that case you should have posted the full problem, albeit with carefully arranged scenario. Hope that did help you to some extent. – Khoi Oct 28 '15 at 01:19

score -1 · Answer 2 · answered Oct 27 '15 at 09:12

-1

re.sub(r"\b\d{0,1}\b<\/?a>\b\d{0,1}\b", r"", text)

answered Oct 27 '15 at 09:12

Mayur Koshti

2 Answers2