1

I feel this is supposed to be simple but I have been struggled to get it right. I'm trying to extract the Employees number ("2,300,000") from this webpage: https://fortune.com/company/walmart/

I used Chrome's extension SelectorGadget to locate the number---"info__row--7f9lE:nth-child(13) .info__value--2AHH7""

```
library(RSelenium)
library(rvest)
library(netstat)

rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=free_port())
remDr<-rs_driver_object$client
remDr$navigate('https://fortune.com/company/walmart/')
Employees<-remDr$findElement(using = 'xpath','//h3[@class="info__row--7f9lE:nth-child(13) .info__value--2AHH7"]')
Employees
```

An error says 

> "Selenium message:no such element: Unable to locate element".

I have also tried:
```
Employees<-remDr$findElement(using = 'class name','info__value--2AHH7')
```
But it returns the data not as wanted. 


Can someone point out the problem? Really appreciate it! 

Updated I modified the code as suggested by Frodo below in the comment to apply to multiple webpages to save the statistics as a dataframe. But I still encountered an error.

    library(RSelenium)
    library(rvest)
    library(netstat)
    
rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=netstat::free_port())
remDr<-rs_driver_object$client


Data<-data.frame("url" = c("https://fortune.com/company/walmart/", "https://fortune.com/company/amazon-com/"              
                           ,"https://fortune.com/company/apple/"                   
                           ,"https://fortune.com/company/cvs-health/" 
                           ,"https://fortune.com/company/jpmorgan-chase/"          
                           ,"https://fortune.com/company/verizon/"                 
                           ,"https://fortune.com/company/ford-motor/"              
                           , "https://fortune.com/company/general-motors/"          
                           ,"https://fortune.com/company/anthem/"                  
                           , "https://fortune.com/company/centene/"                 
                           ,"https://fortune.com/company/fannie-mae/"              
                           , "https://fortune.com/company/comcast/"                 
                           , "https://fortune.com/company/chevron/"                 
                           ,"https://fortune.com/company/dell-technologies/"       
                           ,"https://fortune.com/company/bank-of-america-corp/"    
                           ,"https://fortune.com/company/target/") )

Data$numEmp<-"NA"
Data$numEmp <- numeric()



for (i in 1:length(Data$url))
  {
  
remDr$navigate(url = Data$url[i])
pgSrc <- remDr$getPageSource()
pgCnt <- read_html(pgSrc[[1]])
Data$numEmp[i] <- pgCnt %>%
  html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>%
  html_text(trim = TRUE)

}
Data$numEmp

Selenium message:unknown error: unexpected command response (Session info: chrome=103.0.5060.114) Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10' System info: host: 'DESKTOP-VCCIL8P', ip: '192.168.1.249', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_311' Driver info: driver.version: unknown

Error: Summary: UnknownError Detail: An unknown server-side error occurred while processing the command. class: org.openqa.selenium.WebDriverException Further Details: run errorDetails method

Can someone please take another look?

Xian Zhao
  • 81
  • 1
  • 11

2 Answers2

3

Does it have to be with RSelenium only? In my experience, the most flexible approach is to use RSelenium to navigate to the required pages (where findElement helps you find boxes to enter text into or buttons to click) and then use rvest to extract what you need from the page.

Start with

rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=netstat::free_port())
remDr<-rs_driver_object$client
remDr$navigate('https://fortune.com/company/walmart/')
page_source <- remDr$getPageSource()
pg <- xml2::read_html(page_source[[1]])

How you then go about it depends on how specific you want the solution to be wrt this exact page. Here is one way:

rvest::html_elements(pg, "div.info__row--7f9lE") |> 
  rvest::html_text2()

or

rvest::html_elements(pg, "div:nth-child(13) > div.info__value--2AHH7") |> 
  rvest::html_text2()

or

rvest::html_elements(pg, "div.info__row--7f9lE")[11] |> 
  rvest::html_children()

or

rvest::html_elements(pg, '.info__row--7f9lE:nth-child(13) .info__value--2AHH7') |> 
  rvest::html_text2()

et cetera. What you do in the rvest part would depend on how general you want the selection/extraction process to be.

Steve G. Jones
  • 325
  • 2
  • 10
3

Use RSelenium to load up the webpage and get the page source

remdr$navigate(url = 'https://fortune.com/company/walmart/')
pgSrc <- remdr$getPageSource()

Use Rvest to read the contents of the webpage

pgCnt <- read_html(pgSrc[[1]])

Further, use rvest::html_nodes and rvest::html_text functions to extract the text using relevant xpath selectors. (this Chrome extension should help)

reqTxt <- pgCnt %>%
  html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>%
  html_text(trim = TRUE)

Output of reqTxt

> reqTxt
[1] "2,300,000"

UPDATE

The error Selenium message:unknown error: unexpected command response seems to be occurring specifically 103 version of Chromedriver. More info here. One of the answers there was a giving a simple wait of 5 seconds before and after the driver navigates to the URL. And I have also used tryCatch to keep continuing the code to run within a while loop. Essentially, the code will run until it loads the page. This seems to work.

# Function to fetch employee count
getEmployees <- function(myURL) {
  pagestatus <<- 0
  while(pagestatus == 0) {
    tryCatch(
      expr = remDr$navigate(url = myURL),
      pagestatus <<- 1,
      error = function(error){
        pagestatus <<- 0
        
      }  
    )
  }
  pgSrc <- remDr$getPageSource()
  pgCnt <- read_html(pgSrc[[1]])
  return(pgCnt %>% html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>% html_text(trim = TRUE))
}

Implement this function to all of your dataframe URLs.

for(i in 1:nrow(Data)) {
  Sys.sleep(5)
  Data[i, 2] <- getEmployees(Data[i, 1])
  Sys.sleep(5)
}

Now when we see the output of second column

> Data[, 2]
 [1] "2,300,000" "1,608,000" "154,000"   "258,000"   "271,025"   "118,400"  
 [7] "183,000"   "157,000"   "98,200"    "72,500"    "7,400"     "189,000"  
[13] "42,595"    "133,000"   "208,248"   "450,000"  
Frodo
  • 259
  • 1
  • 7
  • Thanks, Frodo. I was modifying your code to apply for scraping multiple webpages. But I'm quite new to Loop function. Could you please take another look at my code? I updated it. Thanks in advance. – Xian Zhao Jul 17 '22 at 20:18