0

I expect the following regular expression to match, but it does not. Why?

import re
html = '''
                <a href="#">
                    <img src="logo.png" alt="logo" width="100%">
                    </img>
                 </a>
  '''
m = re.match( r'.*logo.*', html, re.M|re.I)

if m: 
    print m.group(1)
if not m:
    print "not found"
Chris Martin
  • 30,334
  • 10
  • 78
  • 137
user1187968
  • 7,154
  • 16
  • 81
  • 152

3 Answers3

12

We don't use regex to parse HTML.

REPEAT AFTER ME: WE DON'T USE REGEX TO PARSE HTML.

That said, it doesn't work because re.match explicitly only checks the beginning of the line. Use re.search or re.findall instead.

Adam Smith
  • 52,157
  • 12
  • 73
  • 112
  • 3
    *Recommends beautiful soup.* – Mr. Polywhirl Feb 10 '14 at 22:27
  • 2
    This answer is better than mine because it gets to the root of the problem. – SethMMorton Feb 10 '14 at 22:27
  • (quickest way to get reputation on SO? Telling someone not to use regex to parse HTML.) – Adam Smith Feb 10 '14 at 22:28
  • @Mr.Polywhirl, Beautiful Soup is just a wrapper around lxml.html these days; why not use the real, underlying library (which is arguably a bit better-designed) directly? – Charles Duffy Feb 10 '14 at 22:28
  • ...but, err, shouldn't the use of a leading `.*` and `re.MULTILINE` allow `re.match` to work too, since the `.*` should match all the preceding content, even through linebreaks? – Charles Duffy Feb 10 '14 at 22:32
  • (though admittedly, arguing about how to do an evil thing better is kinda' silly, as opposed to holding a hard line on Don't Do That). – Charles Duffy Feb 10 '14 at 22:33
  • @CharlesDuffy from the docs: "Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line." – Adam Smith Feb 10 '14 at 22:35
  • 1
    Ahh, gotcha -- `re.DOTALL` would be necessary for the leading `.*` to match newlines. Now that makes sense. – Charles Duffy Feb 10 '14 at 22:37
  • @adsmith As much as I agree with not parsing HTML with regex, giving the OP an example using a parsing library would help correct him of his erred ways. – Uyghur Lives Matter Feb 10 '14 at 22:58
1

Use re.search. re.match assumes the match is at the beginning of the string.

SethMMorton
  • 45,752
  • 12
  • 65
  • 86
1

You needed to include the re.DOTALL (== re.S) flag to allow the . to match newline (\n).

However, that returns the entire document if "logo" appears anywhere in it; not terribly useful.

Slightly better is

import re
html = """
    <a href="#">
        <img src="logo.png" alt="logo" width="100%" />
    </a>
"""

match_logo = re.compile(r'<[^<]*logo[^>]*>', flags = re.I | re.S)

for found in match_logo.findall(html):
    print(found)

which returns

<img src="logo.png" alt="logo" width="100%" />

Better yet would be

from bs4 import BeautifulSoup

pg = BeautifulSoup(html)
print pg.find("img", {"alt":"logo"})
Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99