0

Need help in extracting multiline tag containing multi tag.. Like for example :

<div class="box_update_userdetails_upate">50% discount 4 our members for the items that r put 4 sale.<br />
Send<br />
Join 4sale<br />
9219592195</div>

<div class="box_update_userdetails_upate">Big Offr 4 Our Grp MemBrs:<br />
Jst Add Ur 5 Frns and Gain a Recharge Of 20rs In ur Mob no.<br />
Details<br />
9496360235<br />
addfrn</div>

There may be many
or newline in data. I need to extract anything written between <div class="box_update_userdetails_upate"> and </div> including all <br /> or except <br /> will do the work too.

I tried using "<div class="box_update_userdetails_upate">(.+?)</div>" but that doesn't work for all. That way is only working if there is no newline or break tag in between ..

Lady
  • 73
  • 1
  • 9
  • 2
    Using regex for parsing html is evil, use [html parsers](http://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python) instead. – alecxe Sep 12 '13 at 19:18
  • 1
    if it needs to work in the general case, you need an html/xml parser, else: http://stackoverflow.com/a/1732454/2536029 – mnagel Sep 12 '13 at 19:21
  • ^Funny comment but true. regex's are annoying anyway so why use them when more advanced tools for the job you are trying to do exist? – Shashank Sep 12 '13 at 19:25
  • Okay, i understand. Though, i need kinda perfect answer. :) Thanks to @all – Lady Sep 12 '13 at 19:29

2 Answers2

0

I think what you are looking for is this.

"<div class=\"box_update_userdetails_upate\">(.|\n)*</div>"

The group in the middle will match all characters between two divs. Your main problem was that . does not match newlines normally in Python regex. Note that if you have a nested div, for example <div>...<div>...</div>...</div> the * operator is greedy so it will capture as much text as possible. In other words it will go until the last </div> that it is able to find.

Shashank
  • 13,713
  • 5
  • 37
  • 63
  • it doesn't work as well. Reason is it will break whenever there will be
    tag in b/w. and will only able to retrieve text where there is no break tag and any other.
    – Lady Sep 13 '13 at 10:04
0

to refer to a famous answer on here, using regular expression to parse html is just bad.

def extract(starttag, endtag, text):
    ret = re.compile(r'{a}(.*){b}'.format(a=starttag,b=endtag), re.IGNORECASE).search(text).group(1)
    return ret

this should handle multiple div tags, however it will include the next instance of the div tag in the output, but a simple replace would take care of that

riyoken
  • 574
  • 2
  • 7
  • 17
  • the answer i am referring to is the top voted one on here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?rq=1 – riyoken Sep 13 '13 at 05:20