Rcrawler package: Rcrawler not crawling some websites

Question

I'm using Rcrawler to crawl a vector of urls. For most of them it's working well, but every now and them one of them doesn't get crawled. At first I was only noticing this on https:// sites, which was addressed here. But I'm using version 0.1.7, which is supposed to have https:// capability.

I also found this other user who is having the same problem, but with http:// links as well. I checked on my instance and his websites didn't crawl properly for me either.

Here's what it I get when I try to crawl one of these sites:

>library(Rcrawler)
>Rcrawler("https://manager.submittable.com/beta/discover/?page=1&sort=")
>In process : 1..
Progress: 100.00 %  :  1  parssed from  1  | Collected pages: 1  | 
Level: 1 
+ Check INDEX dataframe variable to see crawling details 
+ Collected web pages are stored in Project folder 
+ Project folder name : manager.submittable.com-191922 
+ Project folder path : /home/anna/Documents/Rstudio/Submittable/manager.submittable.com-191922

Any thoughts? Still waiting for a reply from the creator.

Is there any error message which gives more insights? According to that log, everything is fine — Nico Haase, Apr 20 '18 at 14:56
Maybe the site in question's `robots.txt` contents prohibited crawling? — IRTFM, Apr 20 '18 at 16:20
@NicoHaase, no error messages. According to R everything is running as it's supposed to, but if you go to these sites there are definitely internal urls that aren't getting picked up. Maybe 42- is right and it's a security measure put in place by the host. — amarbut, Apr 20 '18 at 20:12

SalimK · Accepted Answer · 2018-11-14T23:37:02.843

You try to crawl a password protected + javascript pages, you need a web driver to create a login session and render javascript elements, for this reason, Rcrawler V 0.1.9 implements a phantomjs webdriver .

For your case start by installing the last version of Rcrawler then follow these steps :

1 - Install web driver (actually phantomjs)

library(Rcrawler)    
install_browser()

2 - Run the headless browser (a real browser but not visible br <-run_browser()

If you get an error, this means that your operating system or antivirus is blocking the web driver (phantom.js) process, try to disable your antivirus temporarily or adjust your system configuration to allow phantomjs and processx executables

3- Authenticate the session

 br<-LoginSession(Browser = br, LoginURL = 'https://manager.submittable.com/login',
                  LoginCredentials = c('your login','your pass'),
                  cssLoginFields =c('#email', '#password'),
                  XpathLoginButton ="//*[@type=\'submit\']" )

4 - Crawl the website pages

Rcrawler(Website ="https://manager.submittable.com/beta/discover/",no_cores = 1,no_conn = 1, LoggedSession = br, RequestsDelay = 3)

You can access to webdriver functions using :

br$session$

RequestsDelay: 3 seconds given to each request knowing that some javascript take some time to be totally loaded

no_cores=no_conn=1: retrieve pages one by one, as some websites deny multiple logged sessions.

This supposed to crawl password protected web pages, however, bigger websites have an advanced protection against web scraping, like reCAPTCHA or other http/javascript rules that detect successive/automated requests. So it's better using their API if they provide one.

we are still working on providing the ability to crawl multiple websites within one command. Till now you can only crawl each one separately, or use ContentScraper function if you want to scrape URLs/pages from the same website

Rcrawler creator

Rcrawler package: Rcrawler not crawling some websites

1 Answers1