1

I am attempting to retrieve information from a Health Inspection website, then parse and save the data to variables, then maybe save the records to a file. I suppose I could use dictionaries to store the information, from each business.

The website in question is: http://www.swordsolutions.com/Inspections.

Clicking [Search] on the website will start displaying information.

I need to be able to pass some search data to the website, and then parse the information that is returned into variables and then to files.

I am fetching the website to a file using:

import urllib
u = urllib.urlopen('http://www.swordsolutions.com/Inspections')
data = u.read()
f = open('data.html', 'wb')
f.write(data)
f.close()

This is the data that is retrieved by urllib: http://bpaste.net/show/126433/ and currently does not show anything useful.

Any ideas?

Matthew Kelley
  • 129
  • 1
  • 11
  • What data do you need to get from the web-site? – alecxe Aug 26 '13 at 18:58
  • For right now I am just looking to get [Name, Address, Date, Notes] but at the very least I need to be able to "select a city" and press the search button with python, if that makes any sense. – Matthew Kelley Aug 26 '13 at 19:02
  • You may want to use a library like `mechanize` here rather than trying to figure out how to automate the search site from scratch. You _definitely_ want to use an HTML parser (`BeautifulSoup`, one of the stdlib ones, one of the `lxml` ones, etc.) to find your values. And you'll really need to try something and show where you got stuck before you can get a more specific and useful answer on SO, because this site isn't good for interactively walking you through a design and development process. – abarnert Aug 26 '13 at 19:16
  • Also, you may want to look around the site and/or contact the owners to see if there is an API you can use (or, if you're partnering with them or something, if they can _add_ one), because it's a whole lot easier to use well-specified REST addresses and parse JSON responses than to try to automate a search form and scrape human-readable HTML. – abarnert Aug 26 '13 at 19:18
  • I am still somewhat new to python and I am diving in and trying to figure it out again and this is exactly where I got stuck. I just needed some direction of where to go next and not so much an answer to get my end result. [abarnert] and [alecxe], you both have been very helpful. I realize it would be nice if the site had even XML or some other API/responses, but I am not affiliated with the website, and it has been like this for years. I have doubts on them changing up the site just for me. I will look into BeautifulSoup and mechanize. – Matthew Kelley Aug 26 '13 at 19:34

1 Answers1

0

I'll just refer you.

You want to submit a form with several pre-defined field values. Then you want to parse the data returned. Then, next steps depend on whether it is easy to automate that form post request.

You have plenty of options here:

  • using browser developer tools analyze what is going on while clicking on "submit". Then, if there is a simple POST request - simulate it using urllib2 or requests or mechanize or whatever you like
  • give a try to Scrapy and it's FormRequest class
  • use a real automated browser with the help of selenium. Fill the data into fields, click submit, get and parse the data using the same one tool (selenium)

Basically, if there is a lot of javascript logic involved into the form submitting process - you'll have to go with automated browsing tool, like selenium.

Plus, note that there are several tools for parsing HTML: BeautifulSoup, lxml.

Also see:

Hope that helps.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195