4

I am trying to scrape a web site. The traditional HTML parsing through "urllib2.urlopen" from Python or "htmlTreeParse" in R, fail to get the data from the web page. This is done intentionally by the server so that view source won't show the displayed data but when I use the inspect element feature in google chrome (by right-clicking the web site in google chrome) , then I am able to see the data (list of items and their info). My questions is how to programmatically launch the desired pages and save the inspect elements for each page. Alternatively, if I can have a program that will launch these links and somehow use Control-S
to save an html copy of each link to the local disk.

user1848018
  • 1,086
  • 1
  • 14
  • 33
  • The page you're trying to parse is probably malformed, so urllib2 can't handle it but Chrome can. You might be able to parse it with other packages; see http://stackoverflow.com/questions/904644/how-to-parse-malformed-html-in-python, http://stackoverflow.com/questions/2676872/how-to-parse-malformed-html-in-python-using-standard-libraries and http://stackoverflow.com/questions/904644/how-to-parse-malformed-html-in-python . – Anubhav C May 01 '13 at 15:47
  • It is not malformed, they did it intentionally so that when you do view source, it does not have the data. – user1848018 May 01 '13 at 16:02

2 Answers2

3

you can use greasemonkey or tampermonkey to do this quite easily. you simply define the url(s) in your userscript, and then navigate to the page to invoke. you can use a top page containing an iframe that navigates to each page on a schedule. when the page shows in the iframe, the userscript runs, and your data is saved.

the scripting is basic javascript, nothing fancy, let me know if you need a starter. the biggest catch would be downloading the file, a fairly new capability for JS, but simple to do using a download library, like mine (shameless).

so, basically, you can have a textarea with a list of urls, one per line, grab a line and set the iframe's .src to the url, invoking the userscript. You can drill down into the page with CSS query selectors, or save the whole page, just grab .outerHTML of the tag whose code you need. i'll be happy to illustrate if need be, but once you get it working, you'll never go back to server-server scraping again.

EDIT:

A framing dispatcher page to simply load each needed page into an iframe, thus triggering the userScript:

<html>
<iframe id=frame1></iframe>
<script>
var base="http://www.yelp.ca/search?cflt=coffee&find_loc=Toronto,%20ON&start="; //the part of the url that stays the same
var pages=[20, 40, 60, 80];  //all the differing url parts to be concat'd at the end
var delay= 1000 * 30; //30 sec delay, adjust if needed
var slot=0; //current shown page's index in pages

function doNext(){
  var page=pages[slot++];
  if(!page){ page=pages[slot=0]; }
  frame1.src=base+page;
}

setInterval(doNext, delay);
</script>
</html>

EDIT2: userScript code:

// ==UserScript==
// @name       yelp scraper
// @namespace  http://anon.org
// @version    0.1
// @description  grab listing from yelp
// @match     http://www.yelp.ca/search?cflt=coffee&find_loc=Toronto,%20ON&start=*
// @copyright  2013, dandavis
// ==/UserScript==


function Q(a,b){var t="querySelectorAll";b=b||document.documentElement;if(!b[t]){return}if(b.split){b=Q(b)[0]}return [].slice.call(b[t](a))||[]}

function download(strData,strFileName,strMimeType){var D=document,A=arguments,a=D.createElement("a"),d=A[0],n=A[1],t=A[2]||"text/plain";a.href="data:"+strMimeType+","+escape(strData);if('download'in a){a.setAttribute("download",n);a.innerHTML="downloading...";D.body.appendChild(a);setTimeout(function(){var e=D.createEvent("MouseEvents");e.initMouseEvent("click",true,false,window,0,0,0,0,0,false,false,false,false,0,null);a.dispatchEvent(e);D.body.removeChild(a);},66);return true;};var f=D.createElement("iframe");D.body.appendChild(f);f.src="data:"+(A[2]?A[2]:"application/octet-stream")+(window.btoa?";base64":"")+","+(window.btoa?window.btoa:escape)(strData);setTimeout(function(){D.body.removeChild(f);},333);return true;}

window.addEventListener("load", function(){
  var code=Q("#businessresults")[0].outerHTML;
  download(code, "yelp_page_"+location.href.split("start=")[1].split("&")[0]+".txt", "x-application/nothing");
});

note that it saves the html as .txt to avoid a chrome warning about potentially harmful files. you can rename them in bulk, or try making up a new extension and associating it with a browser.

EDIT: forgot to mention to turn off file-saving confirmation in chrome for un-attended use: Settings\Show advanced settings...\Ask where to save each file before downloading (uncheck it)

dandavis
  • 16,370
  • 5
  • 40
  • 36
  • I just added tampermonkey, This is great but I appreciate some help as I am kind of a newbie. Not sure how to do this, I have a list of links like that "http://www.yelp.ca/search?cflt=coffee&find_loc=Toronto%2C+ON&start=40", where the start will be 40,60,80,..., how do I use tampermonkey to launch and download these links? – user1848018 May 01 '13 at 17:31
  • Either by saving these links to the disk or saving the inspect elements for each one, my email is ghofham@gmail in case it is long. Thanks so much – user1848018 May 01 '13 at 17:40
  • Remember using view source on these links will not show the displayed data, that is why I have to either save it to the disk or use inspect elements. – user1848018 May 01 '13 at 17:49
  • i tested the code show and it saves the listing as real HTML. – dandavis May 01 '13 at 20:12
  • glad to help. forgot to mention to turn off file-saving confirmation in chrome for un-attended use: Settings\Show advanced settings...\Ask where to save each file before downloading (uncheck it) – dandavis May 01 '13 at 21:04
1

I would check out Selenium to automate browser functions. You can automate a search by id/name and then do a check to see if it exists, or parse through the html however you would like in an automated fashion.

user856358
  • 573
  • 1
  • 7
  • 18
  • Thanks for the comment, this may be good to launch the links but I need it to either save these pages to the disk (Ctr-S) or somehow use the inspect element feature from chrome to access the data. I don't think it does that – user1848018 May 01 '13 at 16:05