0

For the entries from this link, I need to click each entry, then crawler url of excel file's path in the left bottom part of page:

enter image description here

How could I achieve that using web scrapy packages in R such as rvest, etc.? Sincere thanks at advance.

library(rvest)

# Start by reading a HTML page with read_html():
common_list <- read_html("http://www.csrc.gov.cn/csrc/c100121/common_list.shtml")
common_list %>%
  # extract paragraphs
  rvest::html_nodes("a") %>%
  # extract text
  rvest::html_text() -> webtxt
# inspect
head(webtxt)

First, my question is how could I correctly set html_nodes to get url of each web page?

enter image description here

Update:

> driver
$client
[1] "No sessionInfo. Client browser is mostly likely not opened."

$server
PROCESS 'file105483d2b3a.bat', running, pid 37512.
> remDr
$remoteServerAddr
[1] "localhost"

$port
[1] 4567

$browserName
[1] "chrome"

$version
[1] ""

$platform
[1] "ANY"

$javascript
[1] TRUE

$nativeEvents
[1] TRUE

$extraCapabilities
list()

When I run remDr$navigate(url):

Error in checkError(res) : 
  Undefined error in httr call. httr output: length(url) == 1 is not TRUE
ah bon
  • 9,293
  • 12
  • 65
  • 148

1 Answers1

1

Using rvest to get the links,

library(rvest)
library(dplyr)
library(RSelenium)

link <- url %>%
  read_html() %>%  
  html_nodes('.mt10')

link <- link[[2]] %>% 
  html_nodes("a") %>% 
  html_attr('href') %>% paste0('http://www.csrc.gov.cn', .)

 [1] "http://www.csrc.gov.cn/csrc/c101921/c1758587/content.shtml"                         
 [2] "http://www.csrc.gov.cn/csrc/c101921/c1714636/content.shtml"                         
 [3] "http://www.csrc.gov.cn/csrc/c101921/c1664367/content.shtml"                         
 [4] "http://www.csrc.gov.cn/csrc/c101921/c1657437/content.shtml"                         
 [5] "http://www.csrc.gov.cn/csrc/c101921/c1657426/content.shtml"     
       

We can use RSelenium to loop over the links and download excel files. It took me over a minute to completely load a single webpage. I will demonstrate hetre using a single link.

url <- "http://www.csrc.gov.cn/csrc/c101921/c1758587/content.shtml"
# launch the browser
driver <- rsDriver(browser = c("chrome"))
remDr <- driver[["client"]]

# click on the excel file path
remDr$navigate(url)
remDr$findElement('xpath', '//*[@id="files"]/a')$clickElement()
ah bon
  • 9,293
  • 12
  • 65
  • 148
Nad Pat
  • 3,129
  • 3
  • 10
  • 20
  • Many thanks for your help, please note I don't need to download excel files, loop the pages and store them as dataframe with column names: `excel_filename`, `url` will be perfect. – ah bon Jan 11 '22 at 07:10
  • 1
    @NadPat to add up to your answer, there is a POST API to be used. You can send a POST request to `http://www.csrc.gov.cn/getManuscriptData`. The body of the POST request takes a form-encoded string with two arguments `mId` and `status`. `status` is always 4; `mId` is the character sequence in "cxxxxxxx", 1758587 in this case. It returns a json string. – ekoam Jan 11 '22 at 07:15
  • Thanks for post this insightful comment @ekoam, do you mind share your solutions? I literally know nothing regarding R crawler. – ah bon Jan 11 '22 at 07:23
  • 1
    @ahbon This is already a good (and robust) solution to your problem. If you don't really understand how to send a POST request, then you should probably not do it since it is a bit technical. If you understand POST requests well enough, you should know that the comment above has provided enough information for you to work with. In R, you can send a POST request using `httr::POST`. The documentation is [here](https://www.rdocumentation.org/packages/httr/versions/1.4.2/topics/POST). – ekoam Jan 11 '22 at 07:33
  • OK, I'll look in detail this solution, thanks for sharing your ideas and solutions. – ah bon Jan 11 '22 at 07:38
  • I tested, it raises an error: `Undefined error in httr call. httr output: length(url) == 1 is not TRUE`. – ah bon Jan 11 '22 at 08:39