8

I am querying Freebase to get the genre information for some 10000 movies.

After reading How to optimise scraping with getURL() in R, I tried to execute the requests in parallel. However, I failed - see below. Besides parallelization, I also read that httr might be a better alternative to RCurl.

My questions are: Is it possible to speed up the API calls by using a) a parallel version of the loop below (using a WINDOWS machine)? b) alternatives to getURL such as GET in the httr-package?

library(RCurl)
library(jsonlite)
library(foreach)
library(doSNOW)

df <- data.frame(film=c("Terminator", "Die Hard", "Philadelphia", "A Perfect World", "The Parade", "ParaNorman", "Passengers", "Pink Cadillac", "Pleasantville", "Police Academy", "The Polar Express", "Platoon"), genre=NA)

f_query_freebase <- function(film.title){

  request <- paste0("https://www.googleapis.com/freebase/v1/search?",
                    "filter=", paste0("(all alias{full}:", "\"", film.title, "\"", " type:\"/film/film\")"),
                    "&indent=TRUE",
                    "&limit=1",
                    "&output=(/film/film/genre)")

  temp <- getURL(URLencode(request), ssl.verifypeer = FALSE)
  data <- fromJSON(temp, simplifyVector=FALSE)
  genre <- paste(sapply(data$result[[1]]$output$`/film/film/genre`[[1]], function(x){as.character(x$name)}), collapse=" | ")
  return(genre)
}


# Non-parallel version
# ----------------------------------

for (i in df$film){
  df$genre[which(df$film==i)] <- f_query_freebase(i)      
}


# Parallel version - Does not work
# ----------------------------------

# Set up parallel computing
cl<-makeCluster(2) 
registerDoSNOW(cl)

foreach(i=df$film) %dopar% {
  df$genre[which(df$film==i)] <- f_query_freebase(i)     
}

stopCluster(cl)

# --> I get the following error:  "Error in { : task 1 failed", further saying that it cannot find the function "getURL". 
Community
  • 1
  • 1
majom
  • 7,863
  • 7
  • 55
  • 88
  • 1
    Multi-core is unlikely to speed up web-requests. Read http://stackoverflow.com/questions/22940150/fast-url-query-with-r/22942357#22942357 to use connection pipelining. But be aware that you're hammering someone else's server, so be polite. – hadley Apr 10 '14 at 18:00
  • To get the foreach version to work it looks like you need to add the `.packages=c("RCurl", "jsonlite")` option to foreach so those packages are loaded by the workers. – Steve Weston Apr 10 '14 at 23:49

1 Answers1

2

This doesn't achieve parallel requests within a single R session, however, it's something I've used to achieve >1 simultaneous requests (e.g. in parallel) across multiple R sessions, so it may be useful.

At a high level

You'll want to break the process into a few parts:

  1. Get a list of the URLs/API calls you need to make and store as a csv/text file
  2. Use the code below as a template for starting multiple R processes and dividing the work among them

Note: this happened to run on windows, so I used powershell. On mac this could be written in bash.

Powershell/bash script

Use a single powershell script to start off multiple instances R processes (here we divide the work between 3 processes):

e.g. save a plain text file with .ps1 file extension, you can double click on it to run it, or schedule it with task scheduler/cron:

start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 1; TIMEOUT 20000 }
start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 2; TIMEOUT 20000 }
start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 3; TIMEOUT 20000 }

What's it doing? It will:

  • Go the the Desktop, start a script it finds called extract.R, and provide an argument to the R script (1, 2, and 3).

The R processes

Each R process can look like this

# Get command line argument 
arguments <- commandArgs(trailingOnly = TRUE)
process_number <- as.numeric(arguments[1])

api_calls <- read.csv("api_calls.csv")

# work out which API calls each R script should make (e.g. 
indicies <- seq(process_number, nrow(api_calls), 3)

api_calls_for_this_process_only <- api_calls[indicies, ] # this subsets for 1/3 of the API calls
# (the other two processes will take care of the remaining calls)

# Now, make API calls as usual using rvest/jsonlite or whatever you use for that
stevec
  • 41,291
  • 27
  • 223
  • 311