Python: HTML regex not matching

Question

I have this code:

reg = re.search('<div class="col result_name">(.*)</div>', html)
print 'Value is', reg.group()

Where 'html' contains something like this:

        <div class="col result_name">
            <h4>Blah</h4>
            <p>
                blah
            </p>
        </div>

But it's not returning anything.

Value is
Traceback (most recent call last):
  File "run.py", line 37, in <module>
    print 'Value is', reg.group()

... and this is why you *should NOT* 'parse' HTML with regex. — user225312, Jan 10 '11 at 18:40
[Read this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) then use the appropriate tools for parsing html. — Jochen Ritzel, Jan 10 '11 at 18:40
@A A: No, that is why you should not 'parse' anything with regex without reading the `re` docs. — John Machin, Jan 10 '11 at 20:46

score 6 · Accepted Answer · edited May 23 '17 at 12:18

6

Don't use regex to parse html. Use a html parser

import lxml.html
doc = lxml.html.fromstring(your_html)
result = doc.xpath("//div[@class='col result_name']")
print result

Obligatory link:

RegEx match open tags except XHTML self-contained tags

edited May 23 '17 at 12:18

Community

1
1

answered Jan 10 '11 at 18:41

nosklo

217,122
57
293
297

I'm getting results like this: [, – Zeno Jan 10 '11 at 18:55
@Zeno: Yeah, those are all the divs lxml found in your html. The elements. You can print them, or do further parsing with them. For example, try this: `for onediv in result: print lxml.html.tostring(onediv, pretty_print=True)` – nosklo Jan 10 '11 at 19:27
Does xpath support regex? I want to do something like (col|row) in there. – Zeno Jan 10 '11 at 21:54

score 3 · Answer 2 · answered Jan 10 '11 at 18:39

3

The dot does not neccessarily match newlines in REs, you need the DOTALL flag (?s) for that.

answered Jan 10 '11 at 18:39

Ulrich Schwarz

7,598
1
36
48

score 2 · Answer 3 · answered Jan 10 '11 at 18:40

2

http://docs.python.org/library/re.html :

The special characters are:

'.' (Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.

answered Jan 10 '11 at 18:40

Thom Wiggers

6,938
1
39
65

Python: HTML regex not matching

3 Answers3