1

I am trying to scrape some data that is rendered with javascript. I wanted to try and go the phatomjs route, but am running into some issues invoking phantomjs from within R.

I downloaded phatomjs, placed the file in my working directory, and tried to run the following code found here:

library(rvest)

url <- "http://64px.com/instagram/"

# write out a script phantomjs can process

writeLines(sprintf("var page = require('webpage').create();
page.open('%s', function () {
    console.log(page.content); //page source
    phantom.exit();
});", url), con="scrape.js")

# process it with phantomjs

system("phantomjs scrape.js > scrape.html")

The last command generates this error:

sh: phantomjs: command not found

I did some searching and it may have to do with my PATH, but I followed the advice here and it still throws the same error.

sudo ln -s /phantomjs-2.0.0-macosx/bin/phantomjs /usr/local/bin/

Any idea why it's not finding the phantomjs executable?

Thanks.

Session Info:

R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.2 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggvis_0.4.2     knitr_1.11      dplyr_0.4.3     plyr_1.8.3      stringr_1.0.0   rvest_0.2.0    
 [7] magrittr_1.5    RSelenium_1.3.5 XML_3.98-1.3    RJSONIO_1.3-0   RCurl_1.95-4.7  bitops_1.0-6   
[13] pacman_0.3.0   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.1     xtable_1.7-4    R6_2.1.1        httr_1.0.0      highr_0.5       caTools_1.17.1 
 [7] tools_3.2.2     parallel_3.2.2  DBI_0.3.1       htmltools_0.2.6 assertthat_0.1  digest_0.6.8   
[13] shiny_0.12.2    formatR_1.2     mime_0.3        evaluate_0.7.2  stringi_0.5-5   httpuv_1.3.3
Community
  • 1
  • 1
BillPetti
  • 511
  • 2
  • 7
  • 14
  • Did you try using the full path to your executable? – mrub Jan 02 '16 at 13:20
  • @mrub I did try system("/phantomjs-2.0.0-macosx/bin/phantomjs scrape.js > scrape.html") and got the same message – BillPetti Jan 02 '16 at 13:40
  • 1
    @BillPetti I have no idea about Mac, but `/phantomjs-2.0.0-macosx` as a directory under root seems wrong to me. Are you sure this is the full absolute path? – Artjom B. Jan 02 '16 at 13:58
  • Have you tried running phantomjs from the terminal first? – Jakub Kania Jan 02 '16 at 14:22
  • @ArtjomB. Sorry, yes, this is the full file path I tried `system("/Users/williampetti/phantomjs-2.0.0-macosx/bin/phantomjs scrape.js > scrape.html")`. Still doesn't seem to be working, here is the response: `sh: line 1: 1773 Killed: 9 /Users/williampetti/phantomjs-2.0.0-macosx/bin/phantomjs scrape.js > scrape.html`. The `scrape.html` file has not been populated via the script. – BillPetti Jan 02 '16 at 15:24
  • @JakubKania I don't really work from the command line, but when I try this (`/Users/williampetti/phantomjs-2.0.0-macosx/bin/phantomjs hello.js` where `hello.js` contains the following `console.log('Hello, world!'); phantom.exit();`) the terminal prints `Killed: 9`. – BillPetti Jan 02 '16 at 15:26
  • 1
    @BillPetti well, that's a diffrent problem. Try http://stackoverflow.com/questions/28267809/phantomjs-getting-killed-9-for-anything-im-trying and then work on the path. – Jakub Kania Jan 02 '16 at 15:39
  • @JakubKania bingo! Used the download from [here](https://github.com/eugene1g/phantomjs/releases), moved phantomjs to my working directory and all worked as expected. Many thanks. – BillPetti Jan 02 '16 at 18:25

1 Answers1

2

Unfortunately the 64pix site doesn't use an XHR request for the data, it populates that "top" list on the main page. You can avoid a system call and stay in-R if you do the following:

library(rvest)
library(V8)

url <- "http://64px.com/instagram/"
pg <- read_html(url)

script_data <- html_nodes(pg, "script")[[3]]
dat <- gsub("\\$\\(function.*$", "", html_text(script_data))

ctx <- v8()
ctx$eval(dat)
head(ctx$get("accounts"))
##        username followers followers_now
## 1     instagram  64131228      45251017
## 2  justinbieber  23817614      20279386
## 3 kimkardashian  23519002      22218039
## 4       beyonce  22207790      21375819
## 5  arianagrande  21748827      20219621
## 6   selenagomez  19572601      18456569

i.e. target the inline <script> section that creates the data, trim it off (the javascript will error otherwise), then grab the resultant data. It's a little extra detective work, but much less heavyweight than using phantomjs and especially Selenuim.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205