1

I am trying to parse an html page but I need to filter the results before I parse the page.

For instance, 'http://www.ksl.com/index.php?nid=443' is a classified listing of cars in Utah. Instead of parsing ALL the cars, I'd like to filter it first (ie find all BMWs) and then only parse those pages. Is it possible to fill in a javascript form with python?

Here's what I have so far:

import urllib

content = urllib.urlopen('http://www.ksl.com/index.php?nid=443').read()
f = open('/var/www/bmw.html',"w")
f.write(content)
f.close()
user_78361084
  • 3,538
  • 22
  • 85
  • 147

2 Answers2

2

Here is the way to do it. First download the page, scrape it to find the models that you are looking for, then you can get links to the new pages to scrape. There is no need for javascript here. This model and the BeautifulSoup documentation will get you going.

from BeautifulSoup import BeautifulSoup
import urllib2

base_url = 'http://www.ksl.com'
url = base_url + '/index.php?nid=443'
model = "Honda" # this is the name of the model to look for

# Load the page and process with BeautifulSoup
handle = urllib2.urlopen(url)
html = handle.read()
soup = BeautifulSoup(html)

# Collect all the ad detail boxes from the page
divs = soup.findAll(attrs={"class" : "detailBox"})

# For each ad, get the title
# if it contains the word "Honda", get the link
for div in divs:
    title = div.find(attrs={"class" : "adTitle"}).text
    if model in title:
        link = div.find(attrs={"class" : "listlink"})["href"]
        link = base_url + link
        # Now you have a link that you can download and scrape
        print title, link
    else:
        print "No match: ", title

At the moment of answering, this code snippet is looking for Honda models and returns the following:

1995-  Honda Prelude http://www.ksl.com/index.php?sid=0&nid=443&tab=list/view&ad=8817797
No match:  1994-  Ford Escort
No match:  2006-  Land Rover Range Rover Sport
No match:  2006-  Nissan Maxima
No match:  1957-  Volvo 544
No match:  1996-  Subaru Legacy
No match:  2005-  Mazda Mazda6
No match:  1995-  Chevrolet Monte Carlo
2002-  Honda Accord http://www.ksl.com/index.php?sid=0&nid=443&tab=list/view&ad=8817784
No match:  2004-  Chevrolet Suburban (Chevrolet)
1998-  Honda Civic http://www.ksl.com/index.php?sid=0&nid=443&tab=list/view&ad=8817779
No match:  2004-  Nissan Titan
2001-  Honda Accord http://www.ksl.com/index.php?sid=0&nid=443&tab=list/view&ad=8817770
No match:  1999-  GMC Yukon
No match:  2007-  Toyota Tacoma
daedalus
  • 10,873
  • 5
  • 50
  • 71
  • that's an option but, in addition to this specific example, I wanted to know how to do a javascript query with python for my general understanding – user_78361084 May 03 '12 at 20:57
  • I guess it depends what you use it for. A number of [Python to Javascript libraries are included here](http://stackoverflow.com/questions/683462/best-way-to-integrate-python-and-javascript) and may be good leads. On another track completely, you might be interested in [Selenium](http://seleniumhq.org/) which has a Python library to automate browsing/web testing, or the [mechanize module](http://wwwsearch.sourceforge.net/mechanize/) to fill forms...? – daedalus May 03 '12 at 21:10
-1

If you're using python, Beautifull Soup is what you're looking for.

aldux
  • 2,774
  • 2
  • 25
  • 36
  • Indeed. It's not about javascript, but python, since you're fecthing data using urllib in Python. – aldux May 03 '12 at 19:57
  • but I need to filter the results before I can parse it w/ urllib in Python – user_78361084 May 03 '12 at 20:02
  • But beautifull soup will parse and filter the html page so you don't have to use javascript to do that. Why do you want to use javascript? – aldux May 03 '12 at 20:05
  • That's my question...how do I filter the results..I didn't see anything in the beautiful soup's docs on how to do that – user_78361084 May 03 '12 at 20:06
  • For lots of examples and documentation, check this page: http://www.crummy.com/software/BeautifulSoup/bs3/download/2.x/documentation.html – aldux May 03 '12 at 20:09
  • Again, I don't see anything there that will let me go to 'http://www.ksl.com/index.php?nid=443', enter my selected MAKE (in this case BMW) and then bring back only BMWs. – user_78361084 May 03 '12 at 20:13
  • I can figure out the parsing...I just need to figure out how to load the page that only has BMWs – user_78361084 May 03 '12 at 20:16