Python mulltiline + multitag regex - need solution

Question

Need help in extracting multiline tag containing multi tag.. Like for example :

<div class="box_update_userdetails_upate">50% discount 4 our members for the items that r put 4 sale.<br />
Send<br />
Join 4sale<br />
9219592195</div>

<div class="box_update_userdetails_upate">Big Offr 4 Our Grp MemBrs:<br />
Jst Add Ur 5 Frns and Gain a Recharge Of 20rs In ur Mob no.<br />
Details<br />
9496360235<br />
addfrn</div>

There may be many
or newline in data. I need to extract anything written between <div class="box_update_userdetails_upate"> and </div> including all <br /> or except <br /> will do the work too.

I tried using "<div class="box_update_userdetails_upate">(.+?)</div>" but that doesn't work for all. That way is only working if there is no newline or break tag in between ..

Using regex for parsing html is evil, use [html parsers](http://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python) instead. — alecxe, Sep 12 '13 at 19:18
if it needs to work in the general case, you need an html/xml parser, else: http://stackoverflow.com/a/1732454/2536029 — mnagel, Sep 12 '13 at 19:21
^Funny comment but true. regex's are annoying anyway so why use them when more advanced tools for the job you are trying to do exist? — Shashank, Sep 12 '13 at 19:25
Okay, i understand. Though, i need kinda perfect answer. :) Thanks to @all — Lady, Sep 12 '13 at 19:29

score 0 · Answer 1 · answered Sep 12 '13 at 19:40

0

I think what you are looking for is this.

"<div class=\"box_update_userdetails_upate\">(.|\n)*</div>"

The group in the middle will match all characters between two divs. Your main problem was that . does not match newlines normally in Python regex. Note that if you have a nested div, for example <div>...<div>...</div>...</div> the * operator is greedy so it will capture as much text as possible. In other words it will go until the last </div> that it is able to find.

answered Sep 12 '13 at 19:40

Shashank

13,713
5
37
63

it doesn't work as well. Reason is it will break whenever there will be
tag in b/w. and will only able to retrieve text where there is no break tag and any other. – Lady Sep 13 '13 at 10:04

score 0 · Answer 2 · answered Sep 13 '13 at 05:13

0

to refer to a famous answer on here, using regular expression to parse html is just bad.

def extract(starttag, endtag, text):
    ret = re.compile(r'{a}(.*){b}'.format(a=starttag,b=endtag), re.IGNORECASE).search(text).group(1)
    return ret

this should handle multiple div tags, however it will include the next instance of the div tag in the output, but a simple replace would take care of that

answered Sep 13 '13 at 05:13

riyoken

574
2
7
17

the answer i am referring to is the top voted one on here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?rq=1 – riyoken Sep 13 '13 at 05:20

Python mulltiline + multitag regex - need solution

2 Answers2