19

There are good answers on SO about how to use readHTMLTable from the XML package and I did that with regular http pages, however I am not able to solve my problem with https pages.

I am trying to read table on this website (url string):

library(RTidyHTML)
library(XML)
url <- "https://ned.nih.gov/search/ViewDetails.aspx?NIHID=0010121048"
h = htmlParse(url)
tables <- readHTMLTable(url)

But I get this error: File https://ned.nih.gov/search/Vi...does not exist.

I tried to get past the https problem with this (first 2 lines below)(from using google to find solution (like here:http://tonybreyal.wordpress.com/2012/01/13/r-a-quick-scrape-of-top-grossing-films-from-boxofficemojo-com/).

This trick helps to see more of the page, but any attempts to extract the table are not working. Any advice appreciated. I need the table fields like Organization, Organizational Title, Manager.

 #attempt to get past the https problem 
 raw <- getURL(url, followlocation = TRUE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
 head(raw)
[1] "\r\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; 
...
 h = htmlParse(raw)
Error in htmlParse(raw) : File ...
tables <- readHTMLTable(raw)
Error in htmlParse(doc) : File ...
Charles
  • 50,943
  • 13
  • 104
  • 142
userJT
  • 11,486
  • 20
  • 77
  • 88

3 Answers3

28

The new package httr provides a wrapper around RCurl to make it easier to scrape all kinds of pages.

Still, this page gave me a fair amount of trouble. The following works, but no doubt there are easier ways of doing it.

library("httr")
library("XML")

# Define certicificate file
cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")

# Read page
page <- GET(
  "https://ned.nih.gov/", 
  path="search/ViewDetails.aspx", 
  query="NIHID=0010121048",
  config(cainfo = cafile)
)

# Use regex to extract the desired table
x <- text_content(page)
tab <- sub('.*(<table class="grid".*?>.*</table>).*', '\\1', x)

# Parse the table
readHTMLTable(tab)

The results:

$ctl00_ContentPlaceHolder_dvPerson
                V1                                      V2
1      Legal Name:                    Dr Francis S Collins
2  Preferred Name:                      Dr Francis Collins
3          E-mail:                 francis.collins@nih.gov
4        Location: BG 1 RM 1261 CENTER DRBETHESDA MD 20814
5       Mail Stop:                                       Â
6           Phone:                            301-496-2433
7             Fax:                                       Â
8              IC:             OD (Office of the Director)
9    Organization:            Office of the Director (HNA)
10 Classification:                                Employee
11            TTY:                                       Â

Get httr here: http://cran.r-project.org/web/packages/httr/index.html


EDIT: Useful page with FAQ about the RCurl package: http://www.omegahat.org/RCurl/FAQ.html

Andrie
  • 176,377
  • 47
  • 447
  • 496
  • i get the following error with your code `Error in function (type, msg, asError = TRUE) : SSL certificate problem, verify that the CA cert is OK. Details: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed` – Tyler Rinker May 22 '12 at 07:49
  • 1
    @TylerRinker Thank you. I tried my code in a clean session and it didn't work. Modification posted. This works in a clean session on my machine. (The edit it to include `ssl.verifypeer = FALSE` in the config.) – Andrie May 22 '12 at 11:17
  • Very nice solution and great use of the new httr. I was very curious how this one would be solved. Many government files are on https websites so educational researcher thank you :) +1 – Tyler Rinker May 22 '12 at 14:53
  • Is there a way to get the link information hidden in the HREF link. for example instead of collen baros as manager, I could also know the ID 0010080638 Manager:Colleen Barros – userJT May 22 '12 at 17:16
  • Can you please edit the post to remove `ssl.verifypeer = FALSE` - it's really bad security practice, and is not necessary. (You also don't need to set the certificate path, httr does that for you) – hadley Jul 30 '14 at 17:43
  • @hadley Done. As you say, not necessary. Good catch, thank you. – Andrie Jul 30 '14 at 18:28
  • you may be able to just run `readHTMLTable(x)` without doing any regexp. `GET` is all you need to bypass the `https` barrier. – Brian D Apr 18 '18 at 18:13
4

Using Andrie's great way to get past the https

a way to get at the data without readHTMLTable is also below.

A table in HTML may have an ID. In this case the table has one nice one and the XPath in getNodeSet function does it nicely.

# Define certicificate file
cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")
# Read page
page <- GET(
  "https://ned.nih.gov/", 
  path="search/ViewDetails.aspx", 
  query="NIHID=0010121048",
  config(cainfo = cafile, ssl.verifypeer = FALSE)
)

h = htmlParse(page)
ns <- getNodeSet(h, "//table[@id = 'ctl00_ContentPlaceHolder_dvPerson']")
ns

I still need to extract the IDs behind the hyperlinks.

for example instead of collen baros as manager, I need to get to the ID 0010080638

Manager:Colleen Barros

userJT
  • 11,486
  • 20
  • 77
  • 88
0

This is the function I have to deal with this problem. Detects if https in url and uses httr if it is.

readHTMLTable2=function(url, which=NULL, ...){
 require(httr)
 require(XML)
 if(str_detect(url,"https")){
    page <- GET(url, user_agent("httr-soccer-ranking"))
    doc = htmlParse(text_content(page))
    if(is.null(which)){
      tmp=readHTMLTable(doc, ...)
      }else{
        tableNodes = getNodeSet(doc, "//table")
        tab=tableNodes[[which]]
        tmp=readHTMLTable(tab, ...) 
      }
  }else{
    tmp=readHTMLTable(url, which=which, ...) 
  }
  return(tmp)
}
Eli Holmes
  • 656
  • 7
  • 10