character (0) after scraping webpage in read_html

Question

I'm trying to scrape "1,335,000" from the screenshot below (the number is at the bottom of the screenshot). I wrote the following code in R.

t2<-read_html("https://fortune.com/company/amazon-com/fortune500/")
employee_number <- t2 %>% 
  rvest::html_nodes('body') %>% 
  xml2::xml_find_all("//*[contains(@class, 'info__value--2AHH7')]") %>% 
  rvest::html_text()

However, when I call "employee_number", it gives me "character(0)". Can anyone help me figure out why?

The first line of the body says this: "", it looks like you will need to use Rselenium. — Dave2e, Dec 04 '21 at 15:53
Thanks, @Dave2e, could you please elaborate on why Rselenium can solve the problem? — Xian Zhao, Dec 04 '21 at 15:58
@XianZhao: The page doesn't include that div until after the Javascript runs. RSelenium will run the Javascript (it simulates a complete browser), rvest will just download the initial page. — user2554330, Dec 04 '21 at 17:17

score 1 · Answer 1 · answered Dec 04 '21 at 16:06

As Dave2e pointed the page uses javascript, thus can't make use of rvest.

url = "https://fortune.com/company/amazon-com/fortune500/"
#launch browser 
library(RSelenium)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)

remDr$getPageSource()[[1]] %>% 
  read_html() %>% html_nodes(xpath = '//*[@id="content"]/div[5]/div[1]/div[1]/div[12]/div[2]') %>% 
  html_text()
[1] "1,335,000"

score 1 · Answer 2 · answered Dec 04 '21 at 18:44

Data is loaded dynamically from a script tag. No need for expense of a browser. You could either extract the entire JavaScript object within the script, pass to jsonlite to handle as JSON, then extract what you want, or, if just after the employee count, regex that out from the response text.

library(rvest)
library(stringr)
library(magrittr)
library(jsonlite)

page <- read_html('https://fortune.com/company/amazon-com/fortune500/')

data <- page %>% html_element('#preload') %>% html_text() %>% 
  stringr::str_match(. , "PRELOADED_STATE__ = (.*);") %>% .[, 2] %>% jsonlite::parse_json()

print(data$components$page$`/company/amazon-com/fortune500/`[[6]]$children[[4]]$children[[3]]$config$employees)

#shorter version
print(page %>%html_text() %>% stringr::str_match('"employees":"(\\d+)?"') %>% .[,2] %>% as.integer() %>% format(big.mark=","))

character (0) after scraping webpage in read_html

2 Answers2