0
<a class="image teaser-image ng-star-inserted" target="_self" href="/politik/inland/neuwahlen-2022-welche-szenarien-jetzt-realistisch-sind/401773131">

I just want to extract the "href" (for example the upper HTML tag) in order to concat it with the domain name of this website "https://kurier.at" and web scrape all articles on the home page.

I tried the following code

library(rvest)
library(lubridate)


kurier_wbpg <- read_html("https://kurier.at")

# I just want the "a" tags which come with the attribute "_self" 

articleLinks <- kurier_wbpg %>% html_elements("a")%>%
html_elements(css = "tag[attribute=_self]")  %>% 
html_attr("href")%>% 
paste("https://kurier.at",.,sep = "")

When I execute up to the html_attr("href") part of the above code block, the result I get is

character(0)

I think something wrong with selecting the HTML element tag. I need some help with this?

2 Answers2

1

You need to narrow down your css to the second teaser block image which you can do by using the naming conventions of the classes. You can use url_absolute() to add the domain.

library(rvest)
library(magrittr)

url <- 'https://kurier.at/'
result <- read_html(url) %>% 
  html_element('.teasers-2 .image') %>% 
  html_attr('href') %>% 
  url_absolute(url)

Same principle to get all teasers:

results <- read_html(url) %>% 
  html_elements('.teaser .image') %>% 
  html_attr('href') %>% 
  url_absolute(url)

Not sure if you want the bottom block of 5 included. If so, you can again use classes

articles <- read_html(url) %>% 
  html_elements('.teaser-title') %>% 
  html_attr('href') %>% 
  url_absolute(url)
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Thank you very much for the point "do by using the naming conventions of the classes", I am looking into that now. I used ```.teaser-title``` and 9 article links appeared. But when I checked with the "SelectorGadget" chrome extension there should be 132 article links with the tag ".teaser-title" – manoj rasika Oct 18 '21 at 12:23
  • The additional ones are added dynamically when JavaScript runs in the browser. – QHarr Oct 18 '21 at 20:37
  • Thank you again for further explanations I will try this method. This way I can extract URLs under different tags but finally, I will have to make a single list of URLs, in order to feed them to a for loop which I use to extract the date of each article. – manoj rasika Oct 19 '21 at 08:50
0

It works with xpath -

library(rvest)

kurier_wbpg <- read_html("https://kurier.at")

articleLinks  <- kurier_wbpg %>% 
  html_elements("a") %>%
  html_elements(xpath = '//*[@target="_self"]') %>%
  html_attr('href') %>%
  paste0("https://kurier.at",.)

articleLinks

# [1] "https://kurier.at/plus"
# [2] "https://kurier.at/coronavirus"
# [3] "https://kurier.at/politik"
# [4] "https://kurier.at/politik/inland"
# [5] "https://kurier.at/politik/ausland"
#...
#...
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thank you very much, Ronak for your support, I tried with the XPath and got links to the articles and also links that are not relevant to articles such as 109, 110, 112, 114 as shown below in results. \n ```[109] "https://kurier.at/chronik" ,[110] "https://kurier.at/chronik/oberoesterreich" ,[112] "https://kurier.at/wirtschaft",[114] "https://kurier.at/" ``` \n Now I will have to look for a way to clean it up and just get the article links. I understand the way to grab text under the attribute "target= _self" by your answer Thanks again – manoj rasika Oct 19 '21 at 08:40