1

Google searches for "Smart Factory" and scraping a large number of pages. Google's source is starting (0-90) instead of the start page (1-10), but the first page does not read the contents of each page and duplicates the output.

My code:

library(rvest)
library(KoNLP)

title <- lapply(paste0('https://www.google.co.kr/search?q=smart+factory&ei=MNEnWZfgJoPw0AS7-aYY&sa=N&biw=1011&bih=677&bav=on.2,or.r_cp.#safe=active&q=smartfactory&start=', 0:90),
          function(url){

            url %>% read_html() %>% 
              html_nodes(".r") %>% 
              html_text()

          })

title

Also, when outputting in Korean, the language is broken and output.

enter image description here

Why is this happening?

Jaap
  • 81,064
  • 34
  • 182
  • 193
orreo
  • 31
  • 3
  • related: https://stackoverflow.com/a/22703153/4132844 – scoa May 26 '17 at 08:15
  • As a side-note, if you are going to be scraping a lot from Google search pages, I would add in a `Sys.sleep()` argument that gets a random number from a strange distribution of numbers. I've gotten booted from Google before due to making my scraping too obvious (and I really only did ~150 scrapes). EDIT: Ah, I see @scoa linked to a good discussion in this. – Mark White May 26 '17 at 13:18

0 Answers0