I'm trying to do some webscraping of the IMDB with rvest, and I often encounter a problem with the language output, probably due to my location in Japan.
For example, when trying to scrape the movie titles from this page:
https://www.imdb.com/chart/top/?ref_=nv_mv_250
with the following code:
library(rvest)
library(tidyverse)
url <- "https://www.imdb.com/chart/top/?ref_=nv_mv_250"
read_html(url) %>%
html_nodes(".titleColumn a") %>%
html_text() %>%
tibble(title = .) %>%
head()
The result is a mixture of English and Japanese titles of the movies romanized:
title
<chr>
1 Shôshanku no sora ni
2 Goddofâzâ
3 The Godfather: Part II
4 Dâku naito
5 12 Angry Men
6 Schindler's List
This is the case even though the text on my screen, and even when I inspect the elements using Chrome's developer tools, are all in English.
I guess the issue is similar to the one posted on SO here with reference to scraping using PHP.
Is there a way to request English output, preferably in a tidyverse friendly pipe chain?