10

Goal

I would like to use R to download the HTML of a Google Search webpage as shown in a web browser.

Problem

When I download the Google Search webpage HTML in R, using the exact same URL from the web browser, I have noticed that the R downloaded HTML is different to the web browser HTML e.g. for an advanced Google Search URL the date parameter is ignored in the HTML read in by R whereas in the web browser it is kept.

Example

I do a Google Search in my web browser for "West End Theatre" and specify a date range of 1st January to 31st January 2012. I then copy the generated URL and paste it into R.

# Google Search URL from Firefox web browser
url <- "http://www.google.co.uk/search?q=west+end+theatre&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a#q=west+end+theatre&hl=en&client=firefox-a&hs=z7I&rls=org.mozilla:en-GB%3Aofficial&prmd=imvns&sa=X&ei=rJE7T8fwM82WhQe_6eD2CQ&ved=0CGoQpwUoBw&source=lnt&tbs=cdr:1%2Ccd_min%3A1%2F1%2F2012%2Ccd_max%3A31%2F1%2F2012&tbm=&bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&fp=6f92152f78004c6d&biw=1600&bih=810"
u <- URLdecode(url)

# Webpage as seen in browser
browseURL(u)

# Webpage as seen from R
HTML <- paste(readLines(u), collapse = "\n")
cat(HTML, file = "output01.html")
shell.exec("output01.html")

# Webpage as seen from R through RCurl
library(RCurl)
cookie = 'cookiefile.txt'
curl = getCurlHandle(cookiefile = cookie,
                     useragent =  "Mozilla/5.0 (Windows; U; Windows NT 5.1; en - US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6",
                     header = FALSE,
                     verbose = TRUE,
                     netrc = TRUE,
                     maxredirs = as.integer(20),
                     followlocation = TRUE,
                     ssl.verifypeer = TRUE,
                     cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
HTML2 <- getURL(u, curl = curl)
cat(HTML2, file = "output02.html")
shell.exec("output02.html")

By running the self-contained code above I can see that the first web page which opens is what I want (with the date parameter enforced) but the second and third webpages which open (as downloaded through R) have the date parameter ignored.

Question

How can I download the HTML for the first webpage which opens instead of the second/third webpages?

System Information

> sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RCurl_1.6-10.1 bitops_1.0-4.1

loaded via a namespace (and not attached):
[1] tools_2.14.0
Tony Breyal
  • 5,338
  • 3
  • 29
  • 49
  • Is the URLDecode prior to making the request necessary? – Matt Bridges Feb 15 '12 at 11:50
  • @MattBridges Unfortunately yes. E.g., the following produce different webpages, based on the code above, when entered into R: browseURL(url); browseURL(u) – Tony Breyal Feb 15 '12 at 11:56
  • 1
    I believe Google uses AJAX to reload results with those filtering options enabled. `readLines` and `getURL` obtain the page prior to the results of any AJAX calls. – jbaums Feb 15 '12 at 13:04
  • @jbaums Very interesting information I was not aware of. Is there a way for R to also obtain the page post-AJAX calls too? – Tony Breyal Feb 15 '12 at 13:06
  • @TonyBreyal: There may be, but little has turned up from my searches so far... [this](http://stackoverflow.com/a/8357420) and [this](http://stackoverflow.com/a/260614) may be useful. It seems one needs to be pretty familiar with js and AJAX in order to work out where the page is pulling data from, and replicate that process from R. – jbaums Feb 15 '12 at 13:41
  • 2
    Note that there are many differences between the browser and manual pull -- most notably the user-agent string, cookies and javascript awareness. Google can act on any one of them. Unfortunately we can't test your code, because the URL doesn't work here. As of AJAX, it is no different from a regular request so that it not the problem (you could use tracing facilities of your browser to see where you get the content from). – Simon Urbanek Feb 15 '12 at 17:11
  • @jbaums those were interesting posts, thanks for pointing them out. I've tried setting the curl options but to no avail. The information about using developer tools in Firefox look interesting and I've installed them but it seems a bit beyond me at present. – Tony Breyal Feb 16 '12 at 09:31
  • @SimonUrbanek I'm currently trying to understand what the tracing abilities of Firefox/Chrome are but I must admit to finding it someone difficult to understand. I'm not sure I understand your comment about the URL not working as it and my self contained code works fine from here and on another PC on a different network, both using Windows. Thank you for taking the time to reply however as I appreciate any help I can get :) – Tony Breyal Feb 16 '12 at 09:33
  • Ah, google is checking the client to it responds with 403 Access Denied if you don't use the correct user-agent string. The follow-up result is JSON encoded - for example this URL 'http://www.google.co.uk/search?q=west+end+theatre&hl=en&client=firefox-a&hs=z7I&rls=org.mozilla:en-GB%3Aofficial&prmd=imvns&sa=X&ei=rJE7T8fwM82WhQe_6eD2CQ&ved=0CGoQpwUoBw&source=lnt&tbs=cdr:1%2Ccd_min%3A1%2F1%2F2012%2Ccd_max%3A31%2F1%2F2012&tbm=&fp=1&biw=1600&bih=810&bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&cad=b&tch=1&ech=1&psi=RF89T_O0Cce62gXvremSCA.1329422148627.3' – Simon Urbanek Feb 16 '12 at 20:02
  • @SimonUrbanek I have to be honest with you mate - I don't understand why you ended up with that as it's not something that happens on my end. I'm going to put this down to a lack of knowledge on my part however :) Looks like this question is way harder to solve than I had initially anticipated :( – Tony Breyal Feb 16 '12 at 22:55

2 Answers2

2

Instead of trying to decode the results of Google's search pages, you can just use the Custom Search API. After getting an API key, you will be able to specify your search criteria through the URL, and receive a JSON file instead of having to decode the HTML. The rjson package will help you to read the JSON file into an R object, and extract the relevant data.

You will be limited to a 1000 queries a day, but it might be much easier to work with.

EDIT: Notably, the Custom Search API has been deprecated.

nograpes
  • 18,623
  • 1
  • 44
  • 67
  • This is in theory a good sugestion and something which I have looked into before. However, the main problem I have with the Google Custom Search API is that it is not consistent with the results returned from, say, google.com which is somewhat annoying. To quote Google "your results are unlikely to match those returned by Google Web Search" - reference: http://www.google.com/support/customsearch/bin/answer.py?hl=en&answer=141877 – Tony Breyal Feb 16 '12 at 22:47
  • It sounds like that link is talking specifically about searching specific pages, try the deprecated [Web Search API](https://developers.google.com/web-search/) to search the whole web. – blahdiblah Feb 24 '12 at 00:40
  • It's an interesting idea to work with that API, but given its depreciated status I would be hesitant to invest time in it. A more robust long timer solution is required here (though I've completely hit a brick wall on this one if I'm being honest as I'm finding it difficult to understand the R spidermonkey and RFirefox packages on omegahat.com which looked promising). – Tony Breyal Feb 28 '12 at 11:17
  • @martinbel: Not an appropriate edit, rolled back. It would be appropriate as a comment, or you could post a new answer that uses a better method. – nobody Apr 17 '14 at 15:27
  • @AndrewMedico I just made the edit myself. Although I appreciate the vigilance on inappropriate edits. – nograpes Apr 17 '14 at 21:20
  • @nogapes My edit, was exactly the same as the current. I don't see why it should be rolled back and then copy-pasted. – marbel Apr 18 '14 at 02:05
  • @MartinBel I didn't roll it back, someone else did. I examined your comment, decided it was a good idea to put in, and then added it myself. – nograpes Apr 18 '14 at 07:56
2

Part of your problem is that Google has profiled you and is returning matches based on what it knows from your previous searches, gmail discussions, google maps use, IP address, location data, ads viewed, social contacts and other services. Some of this happens even if you don't have a google account.

Signed-in personalization: When you’re signed in to a Google Account with Web History, Google personalizes your search experience based on what you’ve searched for and which sites you’ve visited in the past.

Signed-out personalization: When you’re not signed in, Google customizes your search experience based on past search information linked to your browser, using a cookie. Google stores up to 180 days of signed-out search activity linked to your browser’s cookie, including queries and results you click.

The only way to make your automated results match your manual one is to try and match your profile. At the very least you should try sending the same User-Agent string as your browser and the same cookies. You can find out what these are by sniffing your HTTP requests on the network or using a browser addon like Live HTTP Headers.

As for why the date is being filtered I think jbaums comment covers that. There's some stuff going on client-side that handles filtering and results-while-you-type. There may be a way around this though if you can trigger googles old interface before the AJAX stuff was added. See what you get from Google in your browser if you disable Javascript.

SpliFF
  • 38,186
  • 16
  • 91
  • 120