Different number of nodes

Question

I want to get some airline reviews from airlinequality.com page, where information about different flight aspects are available. When writing flight review, not all fields are mandatory. This creates structure, when different reviews have different number of elements, which my current code can't handle.

For example, I want to get reviews from this page: http://www.airlinequality.com/airline-reviews/austrian-airlines/page/1/

There are 10 reviews for Seat Comfort, but Inflight Entertainment is available only inf 8. In the end, this creates two vectors of different length, which can't be merged.

My code:

review_html_temp = read_html("http://www.airlinequality.com/airline-reviews/austrian-airlines/page/1/)

    review_seat_comfort = review_html_temp %>%
  html_nodes(xpath = './/table[@class = "review-ratings"]//td[@class = "review-rating-header seat_comfort"]/following-sibling::td/span[@class = "star fill"][last()]') %>%
  html_text() %>%
  str_replace_all(pattern = "[\r\n\t]" , "")

review_entertainment = review_html_temp %>%
  html_nodes(xpath = './/table[@class = "review-ratings"]//td[@class = "review-rating-header inflight_entertainment"]/following-sibling::td//span[@class = "star fill"][last()]') %>%
  html_text() %>%
  str_replace_all(pattern = "[\r\n\t]" , "")

Is there way, how I can fill entertainment value with " " or NA, when node is not present for all 10 reviews? Final results would look like:

seat_comfort: "4" "5" "3" "3" "1" "4" "4" "3" "3" "3"
entertainment_system: "5" "1" NA "1" "1" "3" NA "3" "5" "1"

answer will be very similar to what i learned here: https://stackoverflow.com/questions/41708685/equivalent-of-which-in-scraping — MichaelChirico, Dec 03 '17 at 14:41
I filed an issue since this seems to be a common goal in `rvest` and I don't think the documentation does a good job of communicating this solution: https://github.com/hadley/rvest/issues/206 — MichaelChirico, Dec 03 '17 at 15:27

MichaelChirico · Answer 1 · 2017-12-03T15:14:41.733

The key is that html_nodes(...) %>% html_node(...) will return an entry corresponding to each node returned by html_nodes if the path specified to html_node is absolute. IIUC this means html_node treats each returned node as its own root and returns a unique node for each root (in particular returning NA for nodes where the subsequent call goes unmatched); starting the html_node call with // resets the search and returns the root to the overall page root. I'm not 100% sure of this interpretation, but in practice it means the following can work (NB: I had to download the page as HTML since the site loads dynamically (for me at least) and isn't read by simple read_html).

URL = '~/Desktop/airlines.html'
#get to table; we end at tbody here instead of tr
#  since we only want one entry for each "table" on the
#  page (i.e., for each review); if we add tr there,
#  the html_nodes call will give us an element for
#  _each row of each table_.
tbl = read_html(URL) %>% 
  html_nodes(xpath = '//table[@class="review-ratings"]/tbody')
#note the %s where we'll substitute the particular element we want
star_xp = paste0('tr/td[@class="%s"]/following-sibling::',
                 'td[@class="review-rating-stars stars"]',
                 '/span[@class="star fill"][last()]') 

tbl %>% 
  html_node(xpath = sprintf(star_xp, "review-rating-header seat_comfort")) %>% 
  html_text
#  [1] NA  "4" "5" "3" "3" "1" "4" "4" "3" "3" "3"

This is pretty ugly, but follows the flow of extractions I'm accustomed to seeing. I guess the following would be more maggrittr-y/easy on the eyes, though a bit nonlinear:

star_xp %>% sprintf("review-rating-header seat_comfort") %>%
  html_node(x = tbl, xpath = .) %>% html_text
#  [1] NA  "4" "5" "3" "3" "1" "4" "4" "3" "3" "3"

And for the other:

star_xp %>% sprintf("review-rating-header inflight_entertainment") %>%
  html_node(x = tbl, xpath = .) %>% html_text
#  [1] NA  NA  "5" "1" "1" "1" "3" "3" "5" NA  "1"

When I want to run this code, I get no results for tbl: tbl {xml_nodeset (0)} — user3577904, Dec 03 '17 at 15:48
@user3577904 my guess is you didn't download the page. The page is dynamically generated, so `read_html("http://www.airlinequality.com/airline-reviews/austrian-airlines/page/1/")` doesn't actually return anything. How to do dynamic scraping is a separate issue. — MichaelChirico, Dec 04 '17 at 02:14

score 0 · Answer 2 · answered Dec 03 '17 at 19:46

IIUC, the desired output could be obtained by working with two assumptions. First, in every table, the seat_comfort class comes before the inflight_entertainment class. Second, any two consecutive nodes with the same class will need to be separated by an NA. Essentially, you cannot have two consecutive seat_comfort classes without an inflight_entertainment class in between them. In sum, your td tags should be something like:

<td class="review-rating-header seat_comfort">Seat Comfort</td>
<td class="review-rating-header inflight_entertainment">Inflight Entertainment</td>
<td class="review-rating-header seat_comfort">Seat Comfort</td>
<td class="review-rating-header inflight_entertainment">Inflight Entertainment</td>
...

However, the source of the provided page has a couple of repeating seat_comfort classes. Therefore, you may have to loop through all nodes with either inflight_entertainment or seat_comfort and fill in the gaps where there are two consecutive tags with the same class. The following is an illustration:

library(rvest)


URL <- 'http://www.airlinequality.com/airline-reviews/austrian-airlines/page/1/'
query <- './/table[@class="review-ratings"]//td[contains(@class, "review-rating-header seat_comfort") or contains(@class, "review-rating-header inflight_entertainment")]'
sub_query <- './/following-sibling::td//span[@class = "star fill"][last()]'

review_html_temp <- read_html(URL)

# Get all nodes with either seat_comfort or inflight_entertainment rows
all_nodes <- review_html_temp %>%
  html_nodes(xpath = query)

# Use the text at the first node as a variable to test conditions
current_string <- html_text(all_nodes[1])

# Get the first review value
first_review <- html_nodes(all_nodes[1], xpath = sub_query) %>%
  html_text()

# Loop through all nodes starting at the second node
# Check if the current node's text is the same as the global condition variable
# If so, prepend the review values with an NA
# Otherwise, return the review values
all_output <- lapply(all_nodes[2:length(all_nodes)], function(node) {
  node_text <- html_text(node)
  if (node_text == current_string) {
    current_string <<- node_text
    output <- html_nodes(node, xpath = sub_query) %>%
      html_text()
    c(NA, output)
  } else {
    current_string <<- node_text
    html_nodes(node, xpath = sub_query) %>%
      html_text()
  }
})

# Prepend the first review value to the output
all_output <- c(first_review, unlist(all_output))

# Select seat_comfort and inflight_entertainment
seat_comfort <- all_output[seq(1, length(all_output), 2)]
entertainment <- all_output[seq(2, length(all_output), 2)]

# Make a data.frame
data.frame(seat_comfort=seat_comfort,
           entertainment=entertainment,
           stringsAsFactors = F)

The defined dataframe above should look something like the following:

 seat_comfort entertainment
 4            <NA>         
 5            5            
 3            1            
 3            1            
 1            1            
 4            3            
 4            3            
 3            5            
 3            <NA>         
 3            1

I hope this helps.

Different number of nodes

2 Answers2

Linked