Why does this regex not work: r'.logo.'

Question

I expect the following regular expression to match, but it does not. Why?

import re
html = '''
                <a href="#">
                    <img src="logo.png" alt="logo" width="100%">
                    </img>
                 </a>
  '''
m = re.match( r'.*logo.*', html, re.M|re.I)

if m: 
    print m.group(1)
if not m:
    print "not found"

See also http://stackoverflow.com/a/1732454/14122 – Charles Duffy Feb 10 '14 at 22:27 — Charles Duffy, Feb 10 '14 at 22:27

score 12 · Accepted Answer · answered Feb 10 '14 at 22:27

12

We don't use regex to parse HTML.

REPEAT AFTER ME: WE DON'T USE REGEX TO PARSE HTML.

That said, it doesn't work because re.match explicitly only checks the beginning of the line. Use re.search or re.findall instead.

answered Feb 10 '14 at 22:27

Adam Smith

52,157
12
73
112

3

*Recommends beautiful soup.* – Mr. Polywhirl Feb 10 '14 at 22:27
2

This answer is better than mine because it gets to the root of the problem. – SethMMorton Feb 10 '14 at 22:27
(quickest way to get reputation on SO? Telling someone not to use regex to parse HTML.) – Adam Smith Feb 10 '14 at 22:28
@Mr.Polywhirl, Beautiful Soup is just a wrapper around lxml.html these days; why not use the real, underlying library (which is arguably a bit better-designed) directly? – Charles Duffy Feb 10 '14 at 22:28
...but, err, shouldn't the use of a leading `.*` and `re.MULTILINE` allow `re.match` to work too, since the `.*` should match all the preceding content, even through linebreaks? – Charles Duffy Feb 10 '14 at 22:32
(though admittedly, arguing about how to do an evil thing better is kinda' silly, as opposed to holding a hard line on Don't Do That). – Charles Duffy Feb 10 '14 at 22:33
@CharlesDuffy from the docs: "Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line." – Adam Smith Feb 10 '14 at 22:35
1

Ahh, gotcha -- `re.DOTALL` would be necessary for the leading `.*` to match newlines. Now that makes sense. – Charles Duffy Feb 10 '14 at 22:37
@adsmith As much as I agree with not parsing HTML with regex, giving the OP an example using a parsing library would help correct him of his erred ways. – Uyghur Lives Matter Feb 10 '14 at 22:58

score 1 · Answer 2 · answered Feb 10 '14 at 22:27

1

Use re.search. re.match assumes the match is at the beginning of the string.

answered Feb 10 '14 at 22:27

SethMMorton

45,752
12
65
86

...well, arguably, the `.*` should allow this to match anyhow, with `re.MULTILINE` in use. – Charles Duffy Feb 10 '14 at 22:31
Ok, so if that's not the problem, what is? – SethMMorton Feb 10 '14 at 22:32
That's a good question, and if I knew (or, well, had time/inclination to reproduce), I'd be posting an answer myself. :) – Charles Duffy Feb 10 '14 at 22:34
2

Answered -- would need `re.DOTALL` in addition to `re.MULTILINE` for the leading `.*` to match past a newline. – Charles Duffy Feb 10 '14 at 22:38

score 1 · Answer 3 · answered Feb 10 '14 at 23:04

You needed to include the re.DOTALL (== re.S) flag to allow the . to match newline (\n).

However, that returns the entire document if "logo" appears anywhere in it; not terribly useful.

Slightly better is

import re
html = """
    <a href="#">
        <img src="logo.png" alt="logo" width="100%" />
    </a>
"""

match_logo = re.compile(r'<[^<]*logo[^>]*>', flags = re.I | re.S)

for found in match_logo.findall(html):
    print(found)

which returns

<img src="logo.png" alt="logo" width="100%" />

Better yet would be

from bs4 import BeautifulSoup

pg = BeautifulSoup(html)
print pg.find("img", {"alt":"logo"})

Why does this regex not work: r'.*logo.*'

3 Answers3

Why does this regex not work: r'.logo.'