0

i'm working in python 3.2 (newb) on windows machine (though i have ubuntu 10.04 on virtual box if needed, but i prefer to work on the windows machine).

Basically i'm able to work with the http module and urlib module to scrape web pages, but only those that don't have java script document.write("<div....") and the like that adds data that is not there while i get the actual page (meaning without real ajax scripts).

To process those kind of sites as well i'm pretty sure i need a browser java script processor to work on the page and give me an output with the final result, hopefully as a dict or text.

I tried to compile python-spider monkey but i understand that it's not for windows and it's not working with python 3.x :-?

Any suggestions ? if anyone did something like that before i'll appreciate the help!

codeScriber
  • 4,582
  • 7
  • 38
  • 62

3 Answers3

2

I recommend python's bindings to the webkit library - here is an example. Webkit is cross platform and is used to render webpages in Chrome and Safari. An excellent library.

hoju
  • 28,392
  • 37
  • 134
  • 178
1

Use Firebug to see exactly what is being called to get the data to display (a POST or GET url?). I suspect there's an AJAX call that's retrieving the data from the server either as XML or JSON. Just call the same AJAX call, and parse the data yourself.

Optionally, you can download Selenium for Firefox, start a Selenium server, download the page via Selenium, and get the DOM contents. MozRepl works as well, but doesn't have as much documentation since it's not widely used.

Henley
  • 21,258
  • 32
  • 119
  • 207
  • i suspect u are right ehre, i allerady checked it out with firebug since i could not fid the images links myself in the web page. however for this case it might work o, but if i ever need something big, it will be an ant's work. i need something more substantial. selenium was recomanded by many, maybe i should give it a shot. – codeScriber Mar 19 '11 at 12:42
0

document.write is usually used because you are generating the content on the fly, often by fetching data from a server. What you get are web apps that are more about javascript than HTML. "Scraping" is rather more a question of downloading HTML and processing it, but here there isn't any HTML to download. You are essentially trying to scrape a GUI program.

Most of these applications have some sort of API, often returning XML or JSON data, that you can use instead. If it doesn't, your should probably try to remote control a real webbrowser instead.

Lennart Regebro
  • 167,292
  • 41
  • 224
  • 251
  • you mean take an approach like writing my own firefox extension for reaping whatever information i want from the web page i guess. you are partially right my terms might be misleading. i wanted to harvest some photos from national geographic site(personal use) since i can do that using browser i guessed i can automate it. it might be eaiser had i known firefox extentions... but i do want to do it using programing lang, java or python, python being scripting is preffered... – codeScriber Mar 17 '11 at 15:04
  • @codeScriber: You can also control Firefox from Python. I haven't done that so I'm not entirely sure how it's best done though. – Lennart Regebro Mar 18 '11 at 11:38
  • Writing your own browser extension would be a silly idea when there are perfectly good ones out there already. https://addons.mozilla.org/en-US/firefox/addon/mozrepl/ (I'd assume Python has a preexisting module that can talk to it as well as Perl's WWW::Mechanize::Firefox, but you can write your own easily enough) – Quentin Mar 18 '11 at 11:42