1

I'm trying to do some webscraping of the IMDB with rvest, and I often encounter a problem with the language output, probably due to my location in Japan.

For example, when trying to scrape the movie titles from this page:

https://www.imdb.com/chart/top/?ref_=nv_mv_250

with the following code:

library(rvest)
library(tidyverse)    
url <- "https://www.imdb.com/chart/top/?ref_=nv_mv_250"

read_html(url) %>% 
    html_nodes(".titleColumn a") %>% 
    html_text() %>% 
    tibble(title = .) %>% 
    head()

The result is a mixture of English and Japanese titles of the movies romanized:

  title                 
  <chr>                 
1 Shôshanku no sora ni  
2 Goddofâzâ             
3 The Godfather: Part II
4 Dâku naito            
5 12 Angry Men          
6 Schindler's List 

This is the case even though the text on my screen, and even when I inspect the elements using Chrome's developer tools, are all in English.

I guess the issue is similar to the one posted on SO here with reference to scraping using PHP.

Is there a way to request English output, preferably in a tidyverse friendly pipe chain?

awaji98
  • 685
  • 2
  • 6

2 Answers2

3

Try,

    library(rvest)
    library(tidyverse) 
    library(httr) 

    GET(url = 'https://www.imdb.com/chart/top/?ref_=nv_mv_250'
                  , add_headers(.headers = c('user_agent'= 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'
                                        , 'Accept_language' = 'en-US,en;q=0.9'))) %>% 
          read_html() %>% 
          html_nodes(".titleColumn a") %>% 
          html_text() %>% 
          tibble(title = .) %>% 
          head()
    # A tibble: 6 x 1
      title                   
      <chr>                   
    1 The Shawshank Redemption
    2 The Godfather           
    3 The Godfather: Part II  
    4 The Dark Knight         
    5 12 Angry Men            
    6 Schindler's List
Nad Pat
  • 3,129
  • 3
  • 10
  • 20
  • Thanks for the answer! Unfortunately it doesn't work for me, although I can see why it should. I still end up with the same result - and the only solution I can find is using a VPN. – awaji98 Dec 07 '21 at 03:57
  • As WHardy pointed out, this works if you switch 'accept_language' to 'Accept-Language'. – awaji98 Jan 27 '22 at 14:22
  • Modified the answer as per WHardy – Nad Pat Jan 27 '22 at 18:05
2

Just had a similar issue to solve (but with Polish language). I took a look at the suggestions by Nad Pat and at the link you've provided about doing this in PHP.

It seems Nad Pat almost got it right, but it should be "Accept-Language" instead of "accept_language". This change did the trick for me!

WHardy
  • 21
  • 1