I would like to get below elements from tag "article" :
- the links
- the latitude and longitude
- the number of the pictures of each house
But this doesn't work.
Here is the Python code:
import urllib
import urllib2
import re
import socket
def getPage(infoUrl):
url = infoUrl
try:
request = urllib2.Request(url)
request.add_header("User-Agent","Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:21.0) Gecko/20100101 Firefox/21.0")
response = urllib2.urlopen(request)
except urllib2.URLError, e:
print "Bad Url or timeout"
print type(e)
print e
return ''
except socket.timeout,e:
print "socket timeout"
print type(e)
print e
return ''
else:
return response.read().decode('utf8')
print "Done"
pattern = re.compile(r'<article.*?latitude="(.*?)".*?longtitude="(.*?)"><a href="(.*?)".*?<figcaption.*?>(.*?)</figcaption>.*?</a>',re.S)
infoUrl = 'http://www.zillow.com/homes/MA-02139_rb/'
page = getPage(infoUrl)
items = re.findall(pattern,page)
print items
for item in items:
print item
By the way, this Python script runs pretty slowly.
Any suggestion to optimize it?