1

I have tried to scrape the gross and budget values from IMDB.com using the rvest package but I can't. My code is:

library(rvest)    
movie <- html("http://www.imdb.com/title/tt1490017/")   
movie %>% 
html_node("#budget .itemprop") %>%     
html_text() %>%      
as.numeric()

and I get

numeric(0)
Sam Firke
  • 21,571
  • 9
  • 87
  • 105
thchar
  • 43
  • 6
  • 3
    @KevinDTimm What is exactly the issue here? OP made a decent attempt to solve his problem by searching for a proper package and writing a code. That is way better than 99.999999% of the recent questions on SO. I fail to understand why is this getting downvoted. – David Arenburg Oct 05 '15 at 20:02
  • 2
    What version of rvest are you using (use `sessionInfo()` in R to find out). It would be helpful to see output of intermediate steps and not just the final output. When I run the same code using the latest version of `rvest` from CRAN I get a different error `Error: no matches` – Stefan Avey Oct 05 '15 at 20:12
  • Thanks for your respone. Possible you have right. My version of rvest is 0.2.0 – thchar Oct 07 '15 at 17:11

2 Answers2

1

You can get the budget value like this:

library(tidyr) # for extract_numeric
library(rvest)
movie <- read_html("http://www.imdb.com/title/tt1490017/")
movie %>%
html_nodes("#titleDetails :nth-child(11)") %>%     
  html_text() %>%      
  extract_numeric()

[1] 6e+07

Your example looks similar to an example in the rvest package vignette. That vignette suggests you use SelectorGadget, which I used to find the CSS selector that would return only the Budget element. To see that element, run all but the last piped line of this series, and you'll see why I chose to parse it with extract_numeric from tidyr.

You'll need the latest version of rvest to run this as I'm using the read_html() function, which has replaced html() used in your example.

Sam Firke
  • 21,571
  • 9
  • 87
  • 105
0

Sam Firke provided a very neat solution. I just post mine to show a different alternative to extract the numeric value. As Sam Firke, I used the SelectorGadget. The html function seems to work fine. Instead of tidyr, which I didn't know it had that handy function, I used gsub:

library(rvest)    
movie <- html("http://www.imdb.com/title/tt1490017/") 
movie %>% 
  html_node(".txt-block:nth-child(11)") %>%
  html_text() %>% 
  gsub("\\D", "", .) %>% 
  as.numeric()

Output:

[1] 6e+07
mpalanco
  • 12,960
  • 2
  • 59
  • 67
  • Thanks for your respone, it's very helpful. I share with you code in order to scrap gross movie %>% html_node("#titleDetails .txt-block:nth-child(13)") %>% html_text() %>% gsub("\\D", "", .) %>% as.numeric() – thchar Oct 07 '15 at 17:14
  • @thchar Thank you very much. It works nicely. Are you building some data base? Probably you already know, but there is another package in R `XML` that you might find useful to scrape data. – mpalanco Oct 07 '15 at 17:42
  • I work on dataset in order to predict the quality of movies based on classification methods. Unfortunately, I don't have the desired effect because IMDB don't have the same structure in each page.Maybe I should try with XML library. Do you have any idea ?? – thchar Oct 07 '15 at 18:54
  • 1
    @thchar Some ideas: [Imdb plain text data files](http://www.imdb.com/interfaces) You can check some questions in stackoverflow that may point you in the right direction. For instance [here](http://stackoverflow.com/questions/1966503/does-imdb-provide-an-api) Additionally there are some blogs which can deal with similar problems like [this](https://statofmind.wordpress.com/2014/05/27/using-sentiment-analysis-to-predict-ratings-of-popular-tv-series/) I know that python has specific package to retrieve information from imdb: [imdbpy](http://imdbpy.sourceforge.net/) – mpalanco Oct 07 '15 at 20:21