using reg exps in Python

Question

I used

"<a href='/property-(.+?)-wa[a-zA-Z0-9-\s\" ]{5,50}><img src="

to get the type of property in webpage that I want to analyze

and I can get these message by using regex to analyze code like this:

<a href="/property-house-wa-joondalup-405127028" ><img src=

It is "<a href='/property- + house(what I want)+-wa+ 5-50 chars,numbers,",space +><img src=

I tested it in visualization tool and it seems to be OK

But the output is empty when running code

code:

from urllib.request import urlopen
import re

url='https://www.realestate.com.au/rent/in-perth+-+greater+region,+wa/list-1'
page = urlopen(url).read().decode('utf-8')
##print(page)
propertyReg=re.compile(r"<a href='/property-(.+?)-wa[a-zA-Z0-9-\s\" ]{5,50}><img src=")
propertytext=re.findall(propertyReg,page)
print(propertytext)

[Have you tried using an HTML parser instead?](https://stackoverflow.com/a/1732454/3001761) — jonrsharpe, Sep 02 '18 at 09:45

score 1 · Answer 1 · answered Sep 02 '18 at 09:48

Avoid parsing html data using regex. Use something specifically built for this like Beautiful soup

>>> import requests
>>> from bs4 import BeautifulSoup
>>> 
>>> url='https://www.realestate.com.au/rent/in-perth+-+greater+region,+wa/list-1'
>>> r = requests.get(url)
>>> soup = BeautifulSoup(r.text, 'html.parser')
>>> for a in soup.find_all('a', {'href': re.compile(r'^/property')}):
>>>     property = a['href'].split('-', 2)[1]
>>>     print (property)

score 1 · Accepted Answer · answered Sep 02 '18 at 10:34

There is a bug in your regexp:

Instead of

"<a href='/property-(.+?)-wa[a-zA-Z0-9-\s\" ]{5,50}><img src="

, it should be:

"<a href=['\"]/property-(.+?)-wa[a-zA-Z0-9-\s\" ]{5,50}><img src="

(bot ' and " match after href=)

Regular expressions may be a bit daunting to work with if you need many of them in complicated scenarios. This may be better to use a html parser and match against its results instead. This avoids mistakes, like the one you made, because parser handles attribute values extraction for you.

using reg exps in Python

2 Answers2