re.findall and regex

Question

I need to get the names of something like this content:

<p>
<a name="blu" title="blu"></a>orense
</p>
<p>
<a name="bla" title="bla"></a>toledo
</p>
<p>
<a name="blo" title="blo"></a>sevilla
</p>

but with this code:

names = []
matches = re.findall(r'''<a\stitle="(?P<title>[^">]+)"\sname="(?P<name>[^">]+)"></a>''',content, re.VERBOSE)
for (title, name) in matches:
    if title == name:
        names.append(title)
return names

...I get names=[ ]; what is wrong?. Thanks.

About your *blasphematory* need to use regex for parsing Html, read the first answer to this question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Stephane Rolland, May 09 '12 at 08:05

Tim Pietzcker · Answer 1 · 2012-05-09T08:00:05.770

Uh, well obviously, in your sample text, name comes before title, and in your regex, title is expected before name. This is precisely the reason (or one of them) why you should be using an HTML parser instead. Try BeautifulSoup for example.

If you insist on regex, just turn the parameters around (and make sure that you'll never get those attributes in a different order, and never any other attributes than those):

names = []
matches = re.findall(r'''<a\sname="(?P<name>[^">]+)"\stitle="(?P<title>[^">]+)"></a>''',content, re.VERBOSE)
for (name, title) in matches:
    if title == name:
        names.append(title)

Result:

>>> names
['blu', 'bla', 'blo']

yes, beautifulsoup is more sure by far, but it is inherited code and I didn't find why the regex was wrong. Trees prevented me from seeing the forest. Thanks a lot. — Antonio, May 09 '12 at 08:45

re.findall and regex

1 Answers1