4

This is a follow up question for this one:

How to retrieve titles from google search using rvest

In this time I am trying to get the text behind titles in google search (circled in red):

enter image description here

Due to my lack of knowledge in web design I do not know how to formulate the xpath to extract the text below titles.

The answer by @AllanCameron is very useful but I do not know how to modify it:

library(rvest)
library(tidyverse)
#Code
#url
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
#Get data
first_page <- read_html(url)
titles <- html_nodes(first_page, xpath = "//div/div/div/a/h3") %>% 
  html_text()

Many thanks for your help!

user007
  • 347
  • 1
  • 3
  • 12

2 Answers2

8

This can all be done without Selenium, using rvest. Unfortunately, Google works differently in different locales, so for example in my locale there is a consent page that has to be navigated before I can even send a request to Google.

It seems this is not required in the OPs locale, but for those if you in the UK, you might need to run the following code first for the rest to work:

library(rvest)
library(tidyverse)

url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'

google_handle <- httr::handle('https://www.google.com')
httr::GET('https://www.google.com', handle = google_handle)
httr::POST(paste0('https://consent.google.com/save?continue=',
                  'https://www.google.com/',
                  '&gl=GB&m=0&pc=shp&x=5&src=2',
                  '&hl=en&bl=gws_20220801-0_RC1&uxe=eomtse&',
                  'set_eom=false&set_aps=true&set_sc=true'), 
           handle = google_handle)
url <- httr::GET(url, handle = google_handle)

For the OP and those without a Google consent page, the set up is simply:

library(rvest)
library(tidyverse)

url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'

Next we define the xpaths we are going to use to extract the title (as in the previous Q&A), and the text below the title (pertinent to this question)

title <- "//div/div/div/a/h3"
text  <- paste0(title, "/parent::a/parent::div/following-sibling::div")

Now we can just apply these xpaths to get the correct nodes and extract the text from them:

first_page <- read_html(url)

tibble(title = first_page %>% html_nodes(xpath = title) %>% html_text(),
       text = first_page %>% html_nodes(xpath = text) %>% html_text())
#> # A tibble: 9 x 2
#>   title                                text                                    
#>   <chr>                                <chr>                                   
#> 1 "Mario García Torres - Wikipedia"    "Mario García Torres (born 1975 in Monc~
#> 2 "Mario Torres (@mario_torres25) • I~ "Mario Torres. Oaxaca, México. Luz y co~
#> 3 "Mario Lopez Torres - A Furniture A~ "The Mario Lopez Torres boutiques are a~
#> 4 "Mario Torres - Player profile | Tr~ "Mario Torres. Unknown since: -. Mario ~
#> 5 "Mario García Torres | The Guggenhe~ "Mario García Torres was born in 1975 i~
#> 6 "Mario Torres - Founder - InfOhana ~ "Ve el perfil de Mario Torres en Linked~
#> 7 "3500+ \"Mario Torres\" profiles - ~ "View the profiles of professionals nam~
#> 8 "Mario Torres Lopez - 33 For Sale o~ "H 69 in. Dm 20.5 in. 1970s Tropical Vi~
#> 9 "Mario Lopez Torres's Woven River G~ "28 Jun 2022 · From grass harvesting to~
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • 1
    Could you please explain how you figured out the second xpath for `text`? And is there a reason the subtext doesn't show up in the regular source code, but works when using `httr::GET` and this method? – dcsuka Aug 05 '22 at 14:41
  • @AllanCameron Fantastic as always. Maybe if you could give us a tip on how to write/find the xpath would be amazing. When I go to inspect element I do not find the xpath as you show, so I suspect there is a clue to do it! Many thanks! – user007 Aug 05 '22 at 14:45
  • 1
    @dcsuka for these problems I generally look at the raw html as text, before it is parsed by `read_html`, and not in the browser, where it can be changed by javascript. Whether this is via `httr::GET` or any other method doesn't matter - it's the same html. I looked at the source code here and found that the desired text was inside the next div after the one containing the title. So the xpath I wrote navigates to the title, navigates back up to its parent div, then selects the next div along. – Allan Cameron Aug 05 '22 at 15:04
  • @AllanCameron Thanks Sir, always great solutions ! Bounty awarded! – Duck Aug 12 '22 at 12:01
  • @AllanCameron Hi dear. I have released a similar question. If you could help me it would be great! Many thanks! https://stackoverflow.com/questions/73806861/how-to-retrieve-hyperlinks-in-google-search-using-rvest – user007 Sep 21 '22 at 20:57
  • @AllanCameron Hi sir, sorry for wasting your time, months ago the solution you helped worked fantastic, today I tested again and is not returning any data. If you have any time, could you please check this question? Many thanks! https://stackoverflow.com/questions/74735642/issues-retrieving-links-from-google-search-using-rvest – user007 Dec 08 '22 at 22:39
4

The subtext you refer to appears to be rendered in JavaScript, which makes it difficult to access with conventional read_html() methods.

Here is a script using RSelenium which gets the results necessary. You can also click the next page element if you want to get more results etc.

library(wdman)
library(RSelenium)
library(tidyverse)

selServ <- selenium(
  port = 4446L,
  version = 'latest',
  chromever = '103.0.5060.134', # set to available
)

remDr <- remoteDriver(
  remoteServerAddr = 'localhost',
  port = 4446L,
  browserName = 'chrome'
)

remDr$open()
remDr$navigate("insert URL here")

text_elements <- remDr$findElements("xpath", "//div/div/div/div[2]/div/span")

sapply(text_elements, function(x) x$getElementText()) %>%
  unlist() %>%
  as_tibble_col("results") %>%
  filter(str_length(results) > 15)

# # A tibble: 6 × 1
#   results                                                                                                                        
#   <chr>                                                                                                                          
# 1 "The meaning of HI is —used especially as a greeting. How to use hi in a sentence."                                            
# 2 "Hi definition, (used as an exclamation of greeting); hello! See more."                                                        
# 3 "A friendly, informal, casual greeting said upon someone's arrival. quotations ▽synonyms △. Synonyms: hello, greetings, howdy.…
# 4 "Hi is defined as a standard greeting and is short for \"hello.\" An example of \"hi\" is what you say when you see someone. i…
# 5 "hi synonyms include: hola, hello, howdy, greetings, cheerio, whats crack-a-lackin, yo, how do you do, good morrow, guten tag,…
# 6 "Meaning of hi in English ... used as an informal greeting, usually to people who you know: Hi, there! Hi, how are you doing? …

dcsuka
  • 2,922
  • 3
  • 6
  • 27
  • Very very useful. I will upvote. Lets wait a few if somebody can provide other method otherwise I will accept your answer! – user007 Aug 03 '22 at 14:31
  • 1
    Of course. I would be also be very interested to see a method of scraping this which doesn't use anything heavier than `rvest`. – dcsuka Aug 03 '22 at 16:10