1

I want to scrape some tables of average house rents in Wellington, New Zealand. There are separate tables for each suburb of Wellington, and each is on its own page. The problem I have is finding the address for each of these pages so I can scrape the tables.

Here is the link to the website I am working on http://www.dbh.govt.nz/market-rent?TLA=Wellington&RegionId=9. To find the links for the suburb pages I used the view page source option in Google Chrome. However, despite being able to click each suburb to see the table of rents, the html doesn't seem to provide links; there is no href.

Could anybody explain how these are links without href? Also, does anybody know a way to find the links for each suburbs table? Ultimately I want to use iterate through a list of suburb urls and use python's BeautifulSoup module to extract the tables of rents.

Kind regards, Alex

skaffman
  • 398,947
  • 96
  • 818
  • 769
Alex
  • 313
  • 3
  • 11

1 Answers1

1

You are right, they are not "links", and in that sense there is no href field in them. Each "link" is actually a form <input> element of type submit. Quite an interesting (and non-standard) way of doing things!

Here are some places to learn more about html forms:

You will be able to build the complete http request for each suburb table by referencing the parent <form> element, which will contain the url and the submission "method" (either POST or GET), and by determining the request parameters for each "link" from the corresponding <input> element.

David
  • 13,133
  • 1
  • 30
  • 39
  • Thank you. I looked at the links, which provided some good examples of how to write HTML that produces forms. You mentioned POST and GET. I found examples where I could write an input field, and use $_GET to reproduce the input. As the user can I access $_GET? I want to know what to read up on to access these fields as the user, instead of the designer of the web page. Would you mind giving me a pointer? – Alex Jan 26 '12 at 06:47
  • Looks like you've got some reading to do :). I can only give pointers. Don't worry about $_GET, which is server-side PHP, not client-side python. You will want to simulate what the browser does by parsing the form data and building an http request that you then send to the server. You'll need to learn about http and specifically html forms, and python as well, unless you know it already. You might find these useful: http://livecode.byu.edu/internet/aboutForms.php http://stackoverflow.com/questions/2081586/web-scraping-with-python http://docs.python.org/howto/urllib2.html – David Jan 26 '12 at 17:51
  • Thanks. Yes, I have a lot of reading to do. I wasn't familiar with the terminology to search, "client-side python." Thanks also for the pointer to urllib2, looks to be exactly what I needed to read. – Alex Jan 26 '12 at 23:30
  • 1
    I thought I'd post some resources for newbs like myself. Python client side - http://wwwsearch.sourceforge.net/mechanize/. Examples of web scraping in python - http://polstat.org/blog/2012/1/scraping-web-python/. – Alex Jan 27 '12 at 10:38