19

I'm trying to scrape data from a password-protected website in R. Reading around, it seems that the httr and RCurl packages are the best options for scraping with password authentication (I've also looked into the XML package).

The website I'm trying to scrape is below (you need a free account in order to access the full page): http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2

Here are my two attempts (replacing "username" with my username and "password" with my password):

#This returns "Status: 200" without the data from the page:
library(httr)
GET("http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2", authenticate("username", "password"))

#This returns the non-password protected preview (i.e., not the full page):
library(XML)
library(RCurl)
readHTMLTable(getURL("http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2", userpwd = "username:password"))

I have looked at other relevant posts (links below), but can't figure out how to apply their answers to my case.

How to use R to download a zipped file from a SSL page that requires cookies

How to webscrape secured pages in R (https links) (using readHTMLTable from XML package)?

Reading information from a password protected site

R - RCurl scrape data from a password-protected site

http://www.inside-r.org/questions/how-scrape-data-password-protected-https-website-using-r-hold

itpetersen
  • 1,475
  • 3
  • 13
  • 32

2 Answers2

19

You can use RSelenium. I have used the dev version as you can run phantomjs without a Selenium Server.

# Install RSelenium if required. You will need phantomjs in your path or follow instructions
# in package vignettes
# devtools::install_github("ropensci/RSelenium")
# login first
appURL <- 'http://subscribers.footballguys.com/amember/login.php'
library(RSelenium)
pJS <- phantom() # start phantomjs
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open()
remDr$navigate(appURL)
remDr$findElement("id", "login")$sendKeysToElement(list("myusername"))
remDr$findElement("id", "pass")$sendKeysToElement(list("mypass"))
remDr$findElement("css", ".am-login-form input[type='submit']")$clickElement()

appURL <- 'http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2'
remDr$navigate(appURL)
tableElem<- remDr$findElement("css", "table.datamedium")
res <- readHTMLTable(header = TRUE, tableElem$getElementAttribute("outerHTML")[[1]])
> res[[1]][1:5, ]
Rank             Name Tm/Bye Age Exp Cmp Att  Cm%  PYd Y/Att PTD Int Rsh  Yd TD FantPt
1    1   Peyton Manning  DEN/4  38  17 415 620 66.9 4929  7.95  43  12  24   7  0 407.15
2    2       Drew Brees   NO/6  35  14 404 615 65.7 4859  7.90  37  16  22  44  1 385.35
3    3    Aaron Rodgers   GB/9  31  10 364 560 65.0 4446  7.94  33  13  52 224  3 381.70
4    4      Andrew Luck IND/10  25   3 366 610 60.0 4423  7.25  27  13  62 338  2 361.95
5    5 Matthew Stafford  DET/9  26   6 377 643 58.6 4668  7.26  32  19  34 102  1 358.60

Finally when you are finished close phantomjs

pJS$stop()

If you want to use a traditional browser like firefox for example (if you wanted to stick to the version on CRAN) you would use:

RSelenium::startServer()
remDr <- remoteDriver()
........
........
remDr$closeServer()

in place of the related phantomjs calls.

jdharrison
  • 30,085
  • 4
  • 77
  • 89
  • Thanks, this is a very versatile approach to solve this. – Steve G. Jones Aug 11 '16 at 09:42
  • 2
    While overall this is a very useful answer, it can be noted that lately the package advanced a bit, allowing for more convenient browsing through chrome, firefox or IE without the need of phantomjs, for example, using ``rD <- RSelenium::rsDriver(port = 5555L, 'firefox'); remDr <- rD[["client"]]`` and following the original answer afterwards. – runr Feb 03 '17 at 14:47
  • 2
    @Nutle good points and the phantom function is deprecated in favour of wdman::phantomjs so maybe this answer needs updating – jdharrison Feb 03 '17 at 15:18
  • Not sure why Google still has these really old R posts as the top result for questions like this. At any rate, after a LONG while of trying many different approaches including Rvest and httr, I can say with confidence that in 2023 RSelenium is the only way to go. I didn't want to mess with it at first due to needing to install Java. It's not the most elegant solution, but in many cases it's the only solution that's not an absolute nightmare. With so many websites using autogenerated code that is indecipherable to a human, those other solutions just aren't feasible. RSelenium is the way to go. – imfm Apr 16 '23 at 01:06
  • Nowadays, RSelenium has Chrome drivers that work perfectly. A bonus is that as far as the website admins are concerned, there's nothing strange about how you are accessing their website. As long as you aren't scraping the entire site, you should be fine in terms of TOS etc. The only caveat is that there appears to be some bug where you have to go into the binman install directory (in Windows "C:\Users\YOURUSERNAME\AppData\Local\binman\binman_chromedriver\win32\VERSIONNUMBER\" and delete the license file. It seems to cause some kind of conflict that prevents RSelenium from running properly. – imfm Apr 16 '23 at 01:12
16

I don't have an account to test with, but maybe this will work:

library(httr)
library(XML)

handle <- handle("http://subscribers.footballguys.com") 
path   <- "amember/login.php"

# fields found in the login form.
login <- list(
  amember_login = "username"
 ,amember_pass  = "password"
 ,amember_redirect_url = 
   "http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2"
)

response <- POST(handle = handle, path = path, body = login)

Now, the response object might hold what you need (or maybe you can directly query the page of interest after the login request; I am not sure the redirect will work, but it is a field in the web form), and handle might be re-used for subsequent requests. Can't test it; but this works for me in many situations.

You can output the table using XML

> readHTMLTable(content(response))[[1]][1:5,]
  Rank             Name Tm/Bye Age Exp Cmp Att  Cm%  PYd Y/Att PTD Int Rsh  Yd TD FantPt
1    1   Peyton Manning  DEN/4  38  17 415 620 66.9 4929  7.95  43  12  24   7  0 407.15
2    2       Drew Brees   NO/6  35  14 404 615 65.7 4859  7.90  37  16  22  44  1 385.35
3    3    Aaron Rodgers   GB/9  31  10 364 560 65.0 4446  7.94  33  13  52 224  3 381.70
4    4      Andrew Luck IND/10  25   3 366 610 60.0 4423  7.25  27  13  62 338  2 361.95
5    5 Matthew Stafford  DET/9  26   6 377 643 58.6 4668  7.26  32  19  34 102  1 358.60
jdharrison
  • 30,085
  • 4
  • 77
  • 89
Stefan
  • 1,835
  • 13
  • 20