0

I'm trying to scrape "1,335,000" from the screenshot below (the number is at the bottom of the screenshot). I wrote the following code in R.

t2<-read_html("https://fortune.com/company/amazon-com/fortune500/")
employee_number <- t2 %>% 
  rvest::html_nodes('body') %>% 
  xml2::xml_find_all("//*[contains(@class, 'info__value--2AHH7')]") %>% 
  rvest::html_text()

However, when I call "employee_number", it gives me "character(0)". Can anyone help me figure out why?

enter image description here

Xian Zhao
  • 81
  • 1
  • 11
  • 1
    The first line of the body says this: "", it looks like you will need to use Rselenium. – Dave2e Dec 04 '21 at 15:53
  • Thanks, @Dave2e, could you please elaborate on why Rselenium can solve the problem? – Xian Zhao Dec 04 '21 at 15:58
  • 2
    @XianZhao: The page doesn't include that div until after the Javascript runs. RSelenium will run the Javascript (it simulates a complete browser), rvest will just download the initial page. – user2554330 Dec 04 '21 at 17:17

2 Answers2

1

As Dave2e pointed the page uses javascript, thus can't make use of rvest.

url = "https://fortune.com/company/amazon-com/fortune500/"
#launch browser 
library(RSelenium)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)

remDr$getPageSource()[[1]] %>% 
  read_html() %>% html_nodes(xpath = '//*[@id="content"]/div[5]/div[1]/div[1]/div[12]/div[2]') %>% 
  html_text()
[1] "1,335,000"
Nad Pat
  • 3,129
  • 3
  • 10
  • 20
1

Data is loaded dynamically from a script tag. No need for expense of a browser. You could either extract the entire JavaScript object within the script, pass to jsonlite to handle as JSON, then extract what you want, or, if just after the employee count, regex that out from the response text.

library(rvest)
library(stringr)
library(magrittr)
library(jsonlite)

page <- read_html('https://fortune.com/company/amazon-com/fortune500/')

data <- page %>% html_element('#preload') %>% html_text() %>% 
  stringr::str_match(. , "PRELOADED_STATE__ = (.*);") %>% .[, 2] %>% jsonlite::parse_json()

print(data$components$page$`/company/amazon-com/fortune500/`[[6]]$children[[4]]$children[[3]]$config$employees)

#shorter version
print(page %>%html_text() %>% stringr::str_match('"employees":"(\\d+)?"') %>% .[,2] %>% as.integer() %>% format(big.mark=","))
QHarr
  • 83,427
  • 12
  • 54
  • 101