how to scrape a hyperlink by R and keep the hyperlink clickable in the output file?

Question

I am an R beginner who is trying to look through all the top 500/1000 vote/frequent questions in StackOverflow.

I need a data.frame with two variables that contain the question and the hyperlink of this question, respectively.

The webpage is here and need hyperlinks like this:

        <h3><a href="/questions/5963269/how-to-make-a-great-r-reproducible-example" class="question-hyperlink">How to make a great R reproducible example</a></h3>

output like this:

         question                                    link
    1    how-to-make-a-great-r-reproducible-example  <questions/5963269/how-to-make-a-great-r-reproducible-example>
    2 ...
    3 ...

sorry for my confused questions: I would like to have the hyperlink that can link to the webpage by clicking it. so I get the full url by

web <- data.frame(sapply(answer$link, function(x) {paste("https://stackoverflow.com",x, sep = "")}))

or

web <- data.frame(sapply(df[2], function(x) {paste("https://stackoverflow.com",x, sep = "")}))

web[1]
https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example

like [here] (How to make a great R reproducible example)

I output this file to txt or CSV, but the links could not link the page when I click it.

could you improve it? thanks again

@DaveT @QHarr

You will have to append "https://stackoverflow.com" to the start of each link for it to be clickable. But .txt or .csv files don't understand what a hyperlink is. If you want the hyperlinks to be active, then you need to save the information to a more modern file type such as Word, Excel, maybe even RFT. — Dave2e, Oct 02 '19 at 13:31
I try to output them to excel by using openxlsx package. it did not work. maybe I do it wrong. I think the major problem here is all the links convert into strings when you scrape them. they lose the property of hyperlink. — Jiang Liang, Oct 02 '19 at 13:56

Dave2e · Accepted Answer · 2019-10-02T13:34:11.917

This is a straight forward problem using the rvest package. The principal is to read the page, extract the desired nodes using the CSS selectors and then extracting the requested information.
The tricky part here is to isolate the links associated only with the questions and none of the others. In this case I needed 3-4 levels CSS tags to complete separation.

See the comments in the code for the step by step instructions.

library(rvest)

url<-"https://stackoverflow.com/questions/tagged/r?tab=votes&page=1&pagesize=50"

#read the page
page<-read_html(url)

#get hyperlink nodes
#the 'a' tag under a 'h3' tag under 'div' tag of class 'summary' under a 'div' tag of class 'question-summary'
nodes<-html_nodes(page, "div.question-summary div.summary h3 a")

#Get text
question<-html_text(nodes)
#get link
link<-paste0("https://stackoverflow.com", html_attr(nodes, "href"))

answer<-data.frame(question, link)
head(answer)
                                                         question                                                                             link
1                       How to make a great R reproducible example                    /questions/5963269/how-to-make-a-great-r-reproducible-example
2                    How to sort a dataframe by multiple column(s)                   /questions/1296646/how-to-sort-a-dataframe-by-multiple-columns
3      How to join (merge) data frames (inner, outer, left, right)          /questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right
4 Grouping functions (tapply, by, aggregate) and the *apply family   /questions/3505701/grouping-functions-tapply-by-aggregate-and-the-apply-family
5                                  Drop data frame columns by name                               /questions/4605206/drop-data-frame-columns-by-name
6  Remove rows with all or some NAs (missing values) in data.frame /questions/4862178/remove-rows-with-all-or-some-nas-missing-values-in-data-frame

thanks. As the question I revised here, do you know how to output a hyperlink to a text by R? @DaveT — Jiang Liang, Oct 02 '19 at 13:31

QHarr · Answer 2 · 2019-10-02T13:35:53.997

2

You should really use the API Stack provide. However, you can do this with one level of css selector to gather the a tags by class attribute then separate out text for href with tidyverse functionality; and then perhaps generate a tibble...

library(tidyverse)
library(rvest)

nodes <- read_html('https://stackoverflow.com/questions/tagged/r?tab=votes&page=1&pagesize=50')%>%html_nodes("[class=question-hyperlink]")

df <- map_df(nodes,~{
  questions = .x %>% html_text()
  links =  paste0('https://stackoverflow.com',.x %>% html_attr("href") )
  tibble(questions, links)
})

edited Oct 02 '19 at 13:35

answered Oct 02 '19 at 06:33

QHarr

83,427
12
54
101

thanks. As the question I revised here, do you know how to output a hyperlink to a text by R? @QHarr – Jiang Liang Oct 02 '19 at 13:30
a link could be clickable and link to the webpage. I try output them to excel. it did not work. maybe I do it wrong. – Jiang Liang Oct 02 '19 at 13:50
I don't know how. You can write links as text to csv easily but you need to click in each cell to make them actual hyperlinks. – QHarr Oct 02 '19 at 14:13

how to scrape a hyperlink by R and keep the hyperlink clickable in the output file?

2 Answers2