R html_node challenge, apply multiple html_node to extract same information, then combine the information

Question

I got a challenge that the website layout is not standardized. I would like to extract the name from the page.
However, some pages store name in <a>, some pages store name with both <a> and <span>, some pages store in <span>.

 url="https://stackoverflow.com/questions/12573816/what-is-an-undefined-reference-unresolved-external-symbol-error-and-how-do-i-fix"
  
  page = read_html(url,encoding = "utf-8")

So I was thinking that extract the name from <a> save into one vector, extract the name from <span> save into another vector. Then do the comparison to combine the vectors, however, it is really hard to concatenate the two vector into the only vector which contains all information and in the correct sequence.

user_answeredquestion_a = page %>% html_nodes(xpath="//div[starts-with(@id, 'answer-' )]/div/div[2]/div[2]/div/div[last()]/div/div[last()]/a[last()]") %>%
    html_text()
    user_answeredquestion_a

 user_answeredquestion_span = page %>% html_nodes(xpath="//div[contains(@id, 'answer-' )]/div/div[2]/div[2]/div/div[last()]/div/div[last()]/span")  %>%html_text()
  user_answeredquestion_span

by right, the page contains 30 records. The final length of the user name vector shall is 30. However, the user_answeredquestion_span returned only 29 records. Because it missed record:Kastaneda similarly, user_answeredquestion_a return 29 records, it missed record:user4272649.
In this case, it is really hard to compare and combine these two vectors and save into a new vector with the correct sequence and contains all records (30 records)

  ###which elements are missing from y with respect to x
  # x[!x %in% y]
  ### missing from user_answeredquestion_span
  user_answeredquestion_a[!user_answeredquestion_a %in% user_answeredquestion_span]
  ##"Kastaneda"
  
  ### missing from user_answeredquestion_a
  user_answeredquestion_span[!user_answeredquestion_span %in% user_answeredquestion_a]
  ### "user4272649"

I also try use both xpaths together, it returns me 58 records. it does not make sense.

### To get name from  <a> or <span>
  user_answeredquestion_all = page %>% html_nodes(xpath="//div[starts-with(@id, 'answer-' )]/div/div[2]/div[2]/div/div[last()]/div/div[last()]/a[last()]    | //div[contains(@id, 'answer-' )]/div/div[2]/div[2]/div/div[last()]/div/div[last()]/span") %>%
    html_text()
  user_answeredquestion_all

May I know what is the proper way to handle inconsistent page structure?
HTML Screenshot below: Kastaneda

user4272649

other users which stored in both <a> and <span>

29 elements in vector

Is the correct sequence important as it adds a layer of complexity when looking across a range of pages. — QHarr, Oct 28 '20 at 07:09

score 1 · Answer 1 · answered Oct 27 '20 at 13:08

1

Try this

library(rvest)

url <- "https://stackoverflow.com/questions/12573816/what-is-an-undefined-reference-unresolved-external-symbol-error-and-how-do-i-fix?page=1&tab=votes#tab-top"
path_to_flairs <- "//div[@class='-flair']"
path_to_answerers <- "//div[@class='grid fw-wrap ai-start jc-end gs8 gsy']/div[last()]/div/div[@class='user-details'][last()]/*[last()]"

page <- read_html(url)
# remove user flairs (e.g. reputation and gold badges) so that user names always appear at last
xml_remove(html_nodes(page, xpath = path_to_flairs)) 
page %>% html_nodes(xpath = path_to_answerers) %>% html_text()

Output

 [1] "Luchian Grigore" "Luchian Grigore" "Luchian Grigore" "Luchian Grigore" "Svalorzen"       "Kastaneda"       "Luchian Grigore" "sgryzko"         "Luchian Grigore"
[10] "Luchian Grigore" "Nima Soroush"    "πάντα ῥεῖ"       "Dula"            "Niall"           "Malvineous"      "user4272649"     "developerbmw"    "Niall"          
[19] "kiriloff"        "Plankalkül"      "Mike Kinghan"    "JDiMatteo"       "Niall"           "fafaro"          "Niall"           "Niall"           "Andreas H."     
[28] "Stypox"          "ead"             "Mike Kinghan"

answered Oct 27 '20 at 13:08

ekoam

8,744
1
9
22

Hi Ekoam, thanks for sharing your method. May I check if my understanding is correct. Your method is to remove class='-flair' first. Then all the user names under the same parent class="user-details" will auto become the last element. Regardless if the code has only `` or only ``, both `` & ``contains the user name, I just need to extract the last element from the xpath. The last elemt will always be the user name. Thanks. – JC46JC Oct 28 '20 at 08:25
It looks a smart way to extract data. thanks for sharing. BTW, I would like to learn if it is possible to add a condition to XPath, so the program is smart enough to pick correct XPath to get the data. I assume that it is quite common that the layout of the website is not always structured well, and standardized. In this case, how we can overcome the challenge? thanks in advance. – JC46JC Oct 28 '20 at 08:28
Your understanding is correct. For your second question, xpath does support if-else conditions. See [this](https://stackoverflow.com/a/971142/10802499). However, I am not sure whether or not rvest supports xpath 2.0. @JC46JC – ekoam Oct 28 '20 at 10:19
Doesn't always preserve order. Try with: _https://stackoverflow.com/questions/10714251/how-to-avoid-using-select-in-excel-vba_ – QHarr Oct 29 '20 at 05:48
@QHarr. I tried. Nothing wrong with the XPath. The order is preserved. Geoff Griswald appears after barneyos because `read_html` makes it so. If you simply try `write_html(read_html(url), "test.html")` and open "test.html" in a browser, you will also see those two names in that "wrong" order. Maybe it's a bug. It's also possible that stackoverflow uses javascript to reorder the page after loading it. Anyway, we can't solve it with an XPath. – ekoam Oct 29 '20 at 06:30
1

@QHarr Another interesting observation: if you are using a Chrome browser v86.0.4240.111, try reloading that page several times. You will see those two answers swap their positions sometimes. I guess it's really a javascript issue. – ekoam Oct 29 '20 at 06:37
Aha! Then that would make more sense!!!! I was using css of `.answercell [itemtype="http://schema.org/Person"] > *:first-child` and was utterly puzzled by that. The positioning of Vityata got me. – QHarr Oct 29 '20 at 18:32

R html_node challenge, apply multiple html_node to extract same information, then combine the information

1 Answers1