R - RCurl scrape data from a password-protected site

Question

I'm trying to scrape some table data from a password-protected website (I have a valid username/password) using R and have yet to succeed.

For an example, here's the website to log in to my dentist: http://www.deltadentalins.com/uc/index.html

I have tried the following:

library(httr)
download <- "https://www.deltadentalins.com/indService/faces/Home.jspx?_afrLoop=73359272573000&_afrWindowMode=0&_adf.ctrl-state=12pikd0f19_4"
terms <- "http://www.deltadentalins.com/uc/index.html"
values <- list(username = "username", password = "password", TARGET = "", SMAUTHREASON = "", POSTPRESERVATIONDATA = "",
bundle = "all", dups = "yes")
POST(terms, body = values)
GET(download, query = values)

I have also tried:

your.username <- 'username'
your.password <- 'password'

require(SAScii) 
require(RCurl)
require(XML)

agent="Firefox/23.0" 
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
curl = getCurlHandle()
curlSetOpt(
cookiejar = 'cookies.txt' ,
useragent = agent,
followlocation = TRUE ,
autoreferer = TRUE ,
curl = curl
)

# list parameters to pass to the website (pulled from the source html)
params <-
list(
'lt' = "",
'_eventID' = "",
'TARGET' = "",
'SMAUTHREASON' = "",
'POSTPRESERVATIONDATA' = "",
'SMAGENTNAME' = agent,
'username' = your.username,
'password' = your.password 
    )

#logs into the form
html = postForm('https://www.deltadentalins.com/siteminderagent/forms/login.fcc', .params = params, curl = curl)

# logs into the form
html

I can't get either to work. Are there any experts out there that can help?

You should post answer to help other users looking to do same. — CHP, Jun 06 '14 at 01:18

kng229 · Accepted Answer · 2016-03-05T21:44:14.927

Updated 3/5/16 to work with package Relenium

#### FRONT MATTER ####

library(devtools)
library(RSelenium)
library(XML)
library(plyr)

######################

## This block will open the Firefox browser, which is linked to R
RSelenium::checkForServer()
remDr <- remoteDriver() 
startServer()
remDr$open()
url="yoururl"
remDr$navigate(url)

This first section loads the required packages, sets the login URL, and then opens it in a Firefox instance. I type in my username & password, and then I'm in and can start scraping.

infoTable <- readHTMLTable(firefox$getPageSource(), header = TRUE)
infoTable
Table1 <- infoTable[[1]]
Apps <- Table1[,1] # Application Numbers

For this example, the first page contained two tables. The first is the one I'm interested and has a table of application numbers and names. I pull out the first column (application numbers).

Links2 <- paste("https://yourURL?ApplicantID=", Apps2, sep="")

The data I want are stored in invidiual applications, so this bit created the links that I want to loop through.

### Grabs contact info table from each page

LL <- lapply(1:length(Links2),
function(i) {
url=sprintf(Links2[i])
firefox$get(url)
firefox$getPageSource()
infoTable <- readHTMLTable(firefox$getPageSource(), header = TRUE)

if("First Name" %in% colnames(infoTable[[2]]) == TRUE) infoTable2 <- cbind(infoTable[[1]][1,], infoTable[[2]][1,])

else infoTable2 <- cbind(infoTable[[1]][1,], infoTable[[3]][1,])

print(infoTable2)
}
)

results <- do.call(rbind.fill, LL)
results
write.csv(results, "C:/pathway/results2.csv")

This final section loops through the link for each application, then grabs the table with their contact information (which is either table 2 OR table 3, so R has to check first). Thanks again to Chinmay Patil for the tip on relenium!

R - RCurl scrape data from a password-protected site

1 Answers1