I got a challenge that the website layout is not standardized. I would like to extract the name from the page.
However, some pages store name in <a>
, some pages store name with both <a> and <span>
, some pages store in <span>
.
url="https://stackoverflow.com/questions/12573816/what-is-an-undefined-reference-unresolved-external-symbol-error-and-how-do-i-fix"
page = read_html(url,encoding = "utf-8")
So I was thinking that extract the name from <a>
save into one vector, extract the name from <span>
save into another vector. Then do the comparison to combine the vectors, however, it is really hard to concatenate the two vector into the only vector which contains all information and in the correct sequence.
user_answeredquestion_a = page %>% html_nodes(xpath="//div[starts-with(@id, 'answer-' )]/div/div[2]/div[2]/div/div[last()]/div/div[last()]/a[last()]") %>%
html_text()
user_answeredquestion_a
user_answeredquestion_span = page %>% html_nodes(xpath="//div[contains(@id, 'answer-' )]/div/div[2]/div[2]/div/div[last()]/div/div[last()]/span") %>%html_text()
user_answeredquestion_span
by right, the page contains 30 records. The final length of the user name vector shall is 30. However, the user_answeredquestion_span returned only 29 records. Because it missed record:Kastaneda
similarly, user_answeredquestion_a return 29 records, it missed record:user4272649.
In this case, it is really hard to compare and combine these two vectors and save into a new vector with the correct sequence and contains all records (30 records)
###which elements are missing from y with respect to x
# x[!x %in% y]
### missing from user_answeredquestion_span
user_answeredquestion_a[!user_answeredquestion_a %in% user_answeredquestion_span]
##"Kastaneda"
### missing from user_answeredquestion_a
user_answeredquestion_span[!user_answeredquestion_span %in% user_answeredquestion_a]
### "user4272649"
I also try use both xpaths together, it returns me 58 records. it does not make sense.
### To get name from <a> or <span>
user_answeredquestion_all = page %>% html_nodes(xpath="//div[starts-with(@id, 'answer-' )]/div/div[2]/div[2]/div/div[last()]/div/div[last()]/a[last()] | //div[contains(@id, 'answer-' )]/div/div[2]/div[2]/div/div[last()]/div/div[last()]/span") %>%
html_text()
user_answeredquestion_all
May I know what is the proper way to handle inconsistent page structure?
HTML Screenshot below:
Kastaneda