Getting all instances of a regular expression in python with

Question

I'm trying to get all the link innerHTML's using the following

import re

s = '<div><a href="page1.html" title="page1">Go to 1</a>, <a href="page2.html" title="page2">Go to page 2</a><a href="page3.html" title="page3">Go to page 3</a>, <a href="page4.html" title="page4">Go to page 4</a></div>'
match = re.findall(r'<a.*>(.*)</a>', s)

for string in match:
    print(string)

But I'm only getting the last occurrence, "Go to page 4" I think it's seeing one big string and several matching regex's within, which are treated as over-lapping and ignored. So, how do I get a collection that matches

['Go to page 1', 'Go to page 2', 'Go to page 3', 'Go to page 4']

Jon Clements · Accepted Answer · 2013-07-26T23:00:29.093

2

Your immediate problem is that regexp's are greedy, that is they will attempt to consume the longest string possible. So you're correct that it's finding up until the last </a> it can. Change it to be non-greedy (.*?):

match = re.findall(r'<a.*?>(.*?)</a>', s)
                             ^

However, this is a horrible way of parsing HTML and is not robust, and will break on the smallest of changes.

Here's a far better way of doing it:

from bs4 import BeautifulSoup

s = '<div><a href="page1.html" title="page1">Go to 1</a>, <a href="page2.html" title="page2">Go to page 2</a><a href="page3.html" title="page3">Go to page 3</a>, <a href="page4.html" title="page4">Go to page 4</a></div>'
soup = BeautifulSoup(s)
print [el.string for el in soup('a')]
# [u'Go to 1', u'Go to page 2', u'Go to page 3', u'Go to page 4']

Then, you can use the power of that to also get the href as well as the text, eg:

print [[el.string, el['href'] ]for el in soup('a', href=True)]
# [[u'Go to 1', 'page1.html'], [u'Go to page 2', 'page2.html'], [u'Go to page 3', 'page3.html'], [u'Go to page 4', 'page4.html']]

edited Jul 26 '13 at 23:00

answered Jul 26 '13 at 22:38

Jon Clements

138,671
33
247
280

Thanks! I didn't really quite understand ? in regexs, this was a great learning experience. Here's what I got to work match = re.findall(r'(.*?)', s) – SteveC Jul 26 '13 at 22:41
1

@user1450120 I didn't see the other .* :) Anyway - expect this to break later on or potentially return wrong results... Look at using `beautifulsoup` to parse HTML - it's easy to learn and flexible – Jon Clements Jul 26 '13 at 22:42
What kind of input might cause this to break? – SteveC Jul 26 '13 at 22:44
1

@user1450120 Try `< a> <` for eg: – Jon Clements Jul 26 '13 at 22:48

score 2 · Answer 2 · edited May 23 '17 at 11:57

I would avoid parsing HTML using regex at ALL costs. Check out this article and this SO post as per why. But to sum it up...

Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp

Instead I would take a look at a python HTML parsing package like BeautifulSoup or pyquery. They provide nice interfaces to traverse, retrieve, and edit HTML.

score 1 · Answer 3 · answered Jul 26 '13 at 22:45

I suggest using lxml:

from lxml import etree

s = 'some html'
tree = etree.fromstring(s)
for ele in tree.iter('*'):
    #do something

It provides iterParse function for large file process, also takes in file-like object like urllib2.request object. I have been using this for a long time for parsing html and xml.

See: http://lxml.de/tutorial.html#the-element-class

Getting all instances of a regular expression in python with

3 Answers3