Rcrawler - How to crawl account/password protected sites?

Question

I am trying to crawl and scrape a website's tables. I have an account with the website, and I found out that Rcrawl could help me with getting parts of the table based on specific keywords, etc. The problem is that on the GitHub page there is no mentioning of how to crawl a site with account/password protection.

An example for signing in would be below:

login <- list(username="username", password="password",)

Do you have any idea if Rcrawler has this functionality? For example something like:

Rcrawler(Website = "http://www.glofile.com" +
list (username = "username", password = "password" + no_cores = 4, no_conn = 4, ExtractCSSPat = c(".entry-title",".entry-content"), PatternsNames = c("Title","Content"))

I'm confident my code above is wrong, but I hope it gives you an idea of what I want to do.

score 1 · Answer 1 · answered Nov 16 '18 at 12:04

To crawl or scrape password-protected websites in R, more precisely HTML-based Authentication, you need to use web driver to stimulate a login session, Fortunately, this is possible since Rcrawler v0.1.9, which implement phantomjs web driver ( a browser but without graphics interface).

In the following example will try to log in a blog website

 library(Rcrawler)

Dowload and install web driver

install_browser()

Run the browser session

br<- run_browser()

If you get an error than disable your antivirus or allow the program in your system setting

Run an automated login action and return a logged-in session if successful

 br<-LoginSession(Browser = br, LoginURL = 'http://glofile.com/wp-login.php'
                 LoginCredentials = c('demo','rc@pass@r'),
                 cssLoginFields =c('#user_login', '#user_pass'),
                cssLoginButton ='#wp-submit' )

Finally, if you know already the private pages you want to scrape/download use

DATA <- ContentScraper(... , browser =br)

Or, simply crawl/scrape/download all pages

Rcrawler(Website = "http://glofile.com/",no_cores = 1 ,no_conn = 1,LoggedSession = br ,...)

Don't use multiple parallel no_cores/no_conn as many websites reject multiple sessions by one user. Stay legit and honor robots.txt by setting Obeyrobots = TRUE

You access the browser functions, like :

 br$session$getUrl()
 br$session$getTitle()
 br$session$takeScreenshot(file = "image.png")

Thank you very much for your answer....It looks like an interesting idea to use the Phantomjs. I am no longer working on the project that needed a creative Rcrawler solution but I am confident that the challenge with phantomjs would have been their network security. But if it allows it to run it might be an amazing solution. — Tasos Dalis, Jan 07 '19 at 08:34

Rcrawler - How to crawl account/password protected sites?

1 Answers1