0

I have extracted the reviews of a movie on IMDB but the separate reviews have a lot of blank lines between them. It is unstructured and very difficult to view. I have to apply certain functions on each of them separately and then store them together as 1 for some text mining for some other functions.

How can I structure (clean) them and access them one at a time and also how to combine them and store it together?

Here is my code for scraping the reviews

ID <- 1490017
URL <- paste0("http://www.imdb.com/title/", ID, "/reviews?filter=prolific")
MOVIE_URL <- read_html(URL)
ex_review <- MOVIE_URL %>%
html_nodes("p") %>%
html_text()
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
humble_me
  • 331
  • 3
  • 12

1 Answers1

1

I would suggest that you are more specific when you navigate the DOM. For instance, this code will only deliver reviews and none of the other information that you are presumably not looking to scrape:

ID <- 1490017
URL <- paste0("http://www.imdb.com/title/tt", ID, "/reviews?filter=prolific")
MOVIE_URL <- read_html(URL)
ex_review <- MOVIE_URL %>% html_nodes("#pagecontent") %>%
  html_nodes("div+ p") %>%
  html_text()

And here is a way to remove line breaks, applying a function to each review, and merging all reviews into one paragraph (also see this post on concatenating vector elements and this post on replacing line breaks):

ex_review <- gsub("[\r\n]", " ", ex_review) # replace line breaks
sapply(ex_review, function(x){}) # apply function to each review
ex_review <- paste(ex_review, collapse = "") # concatenate reviews into one paragraph
write(ex_review, "test.txt")

I think you were also missing a "tt" in the URL.

Community
  • 1
  • 1
motorrrr
  • 41
  • 6
  • This greatly improved the extraction.Thanks a lot for the answer. However my main problem was being able to process the reviews that I have extracted since I am unable to do so.Processing them as in removing the removing the the multiple lines between each review. Also combining the text to form one big paragraph of all the reviews. Since I need to do an overall analysis as well. – humble_me Jul 08 '16 at 19:09
  • The line breaks are not getting removed using this method. Other things work fine :) – humble_me Jul 13 '16 at 20:32
  • any other method you could suggest for the same. – humble_me Jul 13 '16 at 20:32