1

I want to scape all the titles fo the result from google search.

For example, if I google 'asus', then I want to scrape all the title of the first page.

My problem is my result is empty.

The code is as below:

url = 'https://www.google.com/search?q=asus'
first_page <- read_html(url)
title = html_nodes(first_page,'h3.LC20lb.DKV0Md') %>% html_text() 

The reason why I use 'h3.LC20lb.DKV0Md' because I inspect the source code like below figure enter image description here

Luke
  • 15
  • 2
  • 7
  • Try the steps in [this tutorial](https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/) – Rohit Mar 02 '20 at 09:20
  • Thank Rohit. I will study it! – Luke Mar 02 '20 at 15:49

1 Answers1

3

The problem is that the class names on Google searches are not constant, so you need to use tag names instead of class names. I find it easier with xpath rather than css selectors:

library(tidyverse)
library(rvest)

url = 'https://www.google.com/search?q=asus'
first_page <- read_html(url)
titles <- html_nodes(first_page, xpath = "//div/div/div/a/div[not(div)]") %>% 
          html_text()
titles <- titles[titles != ">"]
titles <- titles[titles != "View all"]
titles <- titles[nzchar(titles)]

df <- tibble(title  = titles[1:(length(titles)/2) * 2 - 1],
             url    = titles[1:(length(titles)/2) * 2])
df
#> # A tibble: 7 x 2
#>   title                                url                                      
#>   <chr>                                <chr>                                    
#> 1 ASUS United Kingdom                  https://www.asus.com › ...               
#> 2 Asus - Wikipedia                     https://en.wikipedia.org › wiki › Asus   
#> 3 Asus Store: Computers & Accessories~ https://www.amazon.co.uk › Asus-Computer~
#> 4 ASUS - Amazon.co.uk                  https://www.amazon.co.uk › stores › ASUS~
#> 5 ASUS RMA                             https://rma.asus-europe.eu               
#> 6 ASUS Subreddit                       https://www.reddit.com › ASUS            
#> 7 ASUS Deals | Laptops Direct          https://www.laptopsdirect.co.uk › asus

Created on 2020-03-02 by the reprex package (v0.3.0)

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • Thanks Allan! The solution seems work! However, I will it on my Rstudio. I find out that the url column may contain some title rather instead of url. The partial result I tried as below([7]~[9] are title rather than url): > x = df$url > x [6] "https://www.lenovo.com › Home › Laptops" [7] "New Lenovo Ideapad S145-15IWL 81MV000LUS 15.6 HD Intel I3 ..." [8] "81MV000LUS - Lenovo Ideapad S145-15iwl 81mv000lus Notebook ..." [9] "Lenovo Ideapad S145-15iwl 81mv000lus Notebook - 81MV000LUS" – Luke Mar 02 '20 at 15:53
  • @AllanCameron Hi your answer was very useful, but I am not able to kodify your solution for a similar problem. Maybe could you please help me? https://stackoverflow.com/questions/73178126/how-to-retrieve-titles-from-google-search-using-rvest – user007 Jul 30 '22 at 18:27