-1

I would like to get below elements from tag "article" :

  1. the links
  2. the latitude and longitude
  3. the number of the pictures of each house

But this doesn't work.

Here is the Python code:

import urllib
import urllib2
import re
import socket

def getPage(infoUrl):
    url = infoUrl
    try:
        request =  urllib2.Request(url)
        request.add_header("User-Agent","Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:21.0) Gecko/20100101 Firefox/21.0")
        response = urllib2.urlopen(request)
    except urllib2.URLError, e:
        print "Bad Url or timeout"
        print type(e)
        print e
        return ''
    except socket.timeout,e:
        print "socket timeout"
        print type(e)
        print e
        return ''
    else:
        return response.read().decode('utf8')
        print "Done"

pattern = re.compile(r'<article.*?latitude="(.*?)".*?longtitude="(.*?)"><a href="(.*?)".*?<figcaption.*?>(.*?)</figcaption>.*?</a>',re.S)

infoUrl = 'http://www.zillow.com/homes/MA-02139_rb/'
page = getPage(infoUrl)

items = re.findall(pattern,page)
print items
for item in items:
    print item

By the way, this Python script runs pretty slowly.

Any suggestion to optimize it?

Artur Peniche
  • 481
  • 6
  • 27
Bright Liu
  • 57
  • 1
  • 1
  • 8
  • 2
    You've misspelled "longitude"; if it's like that in your actual code, I'd say that's your problem. The regex is so loose (with all those `.*?`'s) that it takes forever to fail. – Alan Moore Sep 03 '15 at 11:40

1 Answers1

1

I strongly advise you to use a library like Beautiful Soup to parse HTML. This is a clear usecase and it will perform way better than you regex.

e.g:

soup = BeautifulSoup(your_html_text)
article = soup.article

will give you the < article > tag.

EDIT: As the question was just changed, please look at the BeautifulSoup documentation in the link above. This will answer your basic question.

Lawrence Benson
  • 1,398
  • 1
  • 16
  • 33
  • 4
    Also, [you shouldn't parse html with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – 301_Moved_Permanently Sep 03 '15 at 11:07