0

I used

"<a href='/property-(.+?)-wa[a-zA-Z0-9-\s\" ]{5,50}><img src="

to get the type of property in webpage that I want to analyze

and I can get these message by using regex to analyze code like this:

<a href="/property-house-wa-joondalup-405127028" ><img src=

It is "<a href='/property- + house(what I want)+-wa+ 5-50 chars,numbers,",space +><img src=

I tested it in visualization tool and it seems to be OK

enter image description here

But the output is empty when running code

code:

from urllib.request import urlopen
import re

url='https://www.realestate.com.au/rent/in-perth+-+greater+region,+wa/list-1'
page = urlopen(url).read().decode('utf-8')
##print(page)
propertyReg=re.compile(r"<a href='/property-(.+?)-wa[a-zA-Z0-9-\s\" ]{5,50}><img src=")
propertytext=re.findall(propertyReg,page)
print(propertytext)
Yiling Liu
  • 666
  • 1
  • 6
  • 21

2 Answers2

1

Avoid parsing html data using regex. Use something specifically built for this like Beautiful soup

>>> import requests
>>> from bs4 import BeautifulSoup
>>> 
>>> url='https://www.realestate.com.au/rent/in-perth+-+greater+region,+wa/list-1'
>>> r = requests.get(url)
>>> soup = BeautifulSoup(r.text, 'html.parser')
>>> for a in soup.find_all('a', {'href': re.compile(r'^/property')}):
>>>     property = a['href'].split('-', 2)[1]
>>>     print (property)
Sunitha
  • 11,777
  • 2
  • 20
  • 23
1

There is a bug in your regexp:

Instead of

"<a href='/property-(.+?)-wa[a-zA-Z0-9-\s\" ]{5,50}><img src="

, it should be:

"<a href=['\"]/property-(.+?)-wa[a-zA-Z0-9-\s\" ]{5,50}><img src="

(bot ' and " match after href=)

Regular expressions may be a bit daunting to work with if you need many of them in complicated scenarios. This may be better to use a html parser and match against its results instead. This avoids mistakes, like the one you made, because parser handles attribute values extraction for you.

Marcin
  • 4,080
  • 1
  • 27
  • 54