parse html with regex, sometimes it doesn't work

Question

I have the following text which I would like to parse with regex. I want to have everything inside the td with class "postcell". I am using this code, but it gives me nothing.

re.finditer('<td class="postcell">(.+?)</td>', doc)

I def don't want to use beautifulsoup

<td class="postcell">
<div>
<div class="post-text" itemprop="text">
<p>Let us consider</p>
<pre class="lang-py prettyprint"><code>x = ['1', '2', '3', '4', '5']
y = ['a', 'b', 'c', 'd', 'e']
</code></pre>
<p>How do I get the required output <code>z</code>?</p>
<pre class="lang-py prettyprint"><code>z = [('1', 'a') , ('b', '2') , ('c', '3') , ('d', '4') , ('e', '5')]
</code></pre>
</div>
<div class="post-taglist">
<a href="/questions/tagged/python" class="post-tag js-gps-track" title="show questions tagged 'python'" rel="tag">python</a> <a href="/questions/tagged/list" class="post-tag js-gps-track" title="show questions tagged 'list'" rel="tag">list</a>
</div>
<table class="fw">
<tbody><tr>
<td class="vt">
<div class="post-menu"><a href="/q/9853438" title="short permalink to this question" class="short-link" id="link-post-9853438">share</a><span class="lsep">|</span><a href="/posts/9853438/edit" class="suggest-edit-post" title="">improve this question</a></div>
</td>
<td align="right" class="post-signature">
<div class="user-info user-hover">
<div class="user-action-time">
<a href="/posts/9853438/revisions" title="show all edits to this post">edited <span title="2012-03-24 16:42:59Z" class="relativetime">Mar 24 '12 at 16:42</span></a>
</div>
<div class="user-gravatar32">
<a href="/users/35070/phihag"><div class="gravatar-wrapper-32"><img src="https://www.gravatar.com/avatar/6f92354195e8874dbee44d5c8714d506?s=32&amp;d=identicon&amp;r=PG" alt="" width="32" height="32" /></div></a>
</div>
<div class="user-details">
<a href="/users/35070/phihag">phihag</a>
<div class="-flair">
<span class="reputation-score" title="reputation score 132,147" dir="ltr">132k</span><span title="31 gold badges"><span class="badge1"></span><span class="badgecount">31</span></span><span title="252 silver badges"><span class="badge2"></span><span class="badgecount">252</span></span><span title="308 bronze badges"><span class="badge3"></span><span class="badgecount">308</span></span>
</div>
</div>
</div> </td>
<td class="post-signature owner">
<div class="user-info ">
<div class="user-action-time">
        asked <span title="2012-03-24 16:40:17Z" class="relativetime">Mar 24 '12 at 16:40</span>
</div>
<div class="user-gravatar32">
<a href="/users/1168528/karthik-reddi"><div class="gravatar-wrapper-32"><img src="https://www.gravatar.com/avatar/acce3b34402cd7646c175c273dee1616?s=32&amp;d=identicon&amp;r=PG" alt="" width="32" height="32" /></div></a>
</div>
<div class="user-details">
<a href="/users/1168528/karthik-reddi">Karthik Reddi</a>
<div class="-flair">
<span class="reputation-score" title="reputation score " dir="ltr">10</span><span title="2 bronze badges"><span class="badge3"></span><span class="badgecount">2</span></span>
</div>
</div>
</div>
</td>
</tr>
</tbody></table>
</div>
</td>

Please, [do not use regex to parse XML/HTML](http://stackoverflow.com/a/1732454/1934349). — paulotorrens, Jul 21 '16 at 02:45
It's been said here at least a million times: **Don't parse HTML or XML with regex**. Use an HTML DOM parser. I don't know why no one ever does any research here at all before asking another *Why can't I parse HTML/XML with my regex?* to find those millions of mentions of why this can't be done. It's my suspicion that those are the same people who ask why they can't fix their broken windows with a hammer. — Ken White, Jul 21 '16 at 02:45
Best comment on that answer: "I think it's time for me to quit the post of Assistant Don't Parse HTML With Regex Officer. No matter how many times we say it, they won't stop coming every day... every hour even. It is a lost cause, which someone else can fight for a bit. So go on, parse HTML with regex, if you must. It's only broken code, not life and death" From bobince. So lets quit together guys =/ I lost count how many times I said it: **Don't parse HTML or XML with regex** — Jorge Campos, Jul 21 '16 at 02:48

score 1 · Accepted Answer · answered Jul 21 '16 at 03:30

1

You forgot to escape the /.

re.finditer('<td class="postcell">(.+?)<\/td>', doc)

Other commenters are right that it's impossible to parse html with regex in general. For your case it might be good enough. Just know the limitations like that regular expressions are blind to nesting, so you may run into edge cases like that if there's a <\td> inside one of your post cells your match will end early.

answered Jul 21 '16 at 03:30

Trevor Merrifield

4,541
2
21
24

Thanks a lot for your answer! I am just wondering why it may break? Is it because of nested tags? – Erin Jul 22 '16 at 19:58
Right. If you're scraping stackoverflow you could run into trouble if the post has `` tags because of markdown or some other formatting. I'm not sure if any markdown actually introduces that tag. – Trevor Merrifield Jul 22 '16 at 20:18
For example your regex would match `blah blah blah` as `` – Trevor Merrifield Jul 22 '16 at 20:21

parse html with regex, sometimes it doesn't work

1 Answers1