7

Searching google with keywords "health hospital" returns about 1,150,000,000 results. How can this count be obtained programmatically in R?

I have seen this link where they try to solve it using Java. How can this be done in R? An example code snippet would be appreciated.

Thanks.

Community
  • 1
  • 1
user6633625673888
  • 625
  • 2
  • 7
  • 17
  • One starting point would be to try webscraping, and for that the `rvest` package is a prob the best bet. `rvest::html('https://www.google.com/search?q=QUERY_HERE')` BUT, I think that's disallowed by google, since it gives you a 403 forbidden error. No doubt you saw the relevant commentary about that in the SO thread linked. The accepted answer uses a spoof user-agent, but that's almost certainly not officially allowed – arvi1000 May 12 '15 at 21:15
  • Take a look at the code in `http://www.l_m_g_t_f_y.com/?q=%22health+hospital%22 ... without the underscores because there is SO nanny-code that blocks that particular acronym. – IRTFM May 12 '15 at 22:00
  • Valid question in my opinion, it features an example and an expected output, and it's useful to R users. – moodymudskipper Apr 12 '19 at 14:41

1 Answers1

5

Modifying just one line of the code found on theBioBucket blog post: Get No. of Google Search Hits with R and XML:

GoogleHits <- function(input)
   {
    require(XML)
    require(RCurl)
    url <- paste("https://www.google.com/search?q=",
                 input, sep = "") # modified line      
    CAINFO = paste(system.file(package="RCurl"), "/CurlSSL/ca-bundle.crt", sep = "")
    script <- getURL(url, followlocation = TRUE, cainfo = CAINFO)
    doc <- htmlParse(script)
    res <- xpathSApply(doc, '//*/div[@id="resultStats"]', xmlValue)
    cat(paste("\nYour Search URL:\n", url, "\n", sep = ""))
    cat("\nNo. of Hits:\n") # get rid of cat text if not wanted
    return(as.integer(gsub("[^0-9]", "", res)))
   }

# Example:
no.hits <- GoogleHits("health%20hospital")
#Your Search URL:
#https://www.google.com/search?q=health%20hospital
#
#No. of Hits:
no.hits
#[1] 1170000000

I changed the url assignment from

url <- paste("https://www.google.com/search?q=\"", input, "\"", sep = "")

to

url <- paste("https://www.google.com/search?q=", input, sep = "")
Jota
  • 17,281
  • 7
  • 63
  • 93