0

I had a look at this Question: Inputting NA where there are missing values when scraping with rvest with a great answer!

Goal: Achieve the same result with xpath.

It seems in the example css identifiers are used:

xx <- read_html("https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference?sort=status&direction=desc&page=14")   
xx %>% html_nodes(xpath = "/html/body/main/section[2]/div/article") %>%  
       map_df(~list(title = html_nodes(.x, css = 'header h3 a') %>% 
        html_text() %>% {if(length(.) == 0) NA else .},    # replace length-0 elements with NA
        length = html_nodes(.x, css = 'a time') %>% 
        html_text() %>%  {if(length(.) == 0) NA else .}))

Question: How can it be done with xpath?

xpath should acutally be:

'/header/h3/a'

What i tried:

## XPath
xx <- read_html("https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference?sort=status&direction=desc&page=14")   
xx %>% html_nodes(xpath = "/html/body/main/section[2]/div/article") %>%  
  map_df(~list(title = html_nodes(.x, xpath = '/header/h3/a') %>% 
                 html_text() %>% {if(length(.) == 0) NA else .},    # replace length-0 elements with NA
               length = html_nodes(.x, xpath = '/a/time') %>% 
                 html_text() %>%  {if(length(.) == 0) NA else .}))
Tlatwork
  • 1,445
  • 12
  • 35

1 Answers1

1

Your xpath should be header/h3/a, not /header/h3/a. The leading slash would imply you want to start at the root of the tree again, not the current node.

xx <- read_html("https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference?sort=status&direction=desc&page=14")   
xx %>% html_nodes(xpath = "/html/body/main/section[2]/div/article") %>%  
  map_df(~list(title = html_nodes(.x, xpath = 'header/h3/a') %>% 
                 html_text() %>% {if(length(.) == 0) NA else .},    # replace length-0 elements with NA
               length = html_nodes(.x, xpath = 'a/time') %>% 
                 html_text() %>%  {if(length(.) == 0) NA else .}))

#   title                                                                        length  
#   <chr>                                                                        <chr>   
# 1 " Introduction to Natural Language Processing with R II"                     01:15:00
# 2 " Introduction to Natural Language Processing with R"                        01:22:13
# 3 " Solving iteration problems with purrr II"                                  01:22:49
# 4 " Solving iteration problems with purrr"                                     01:32:23
# 5 Markov-Switching GARCH Models in R: The MSGARCH Package                      15:55   
# 6 Interactive bullwhip effect exploration using SCperf and Shiny               16:02   
# 7 Actuarial and statistical aspects of reinsurance in R                        14:15   
# 8 Transformation Forests                                                       16:19   
MrFlick
  • 195,160
  • 17
  • 277
  • 295