My assignment for a course was to scrape data from news media and analyse it. It is my first experience of scraping with R and I got stuck for several weeks with obtaining the data, checking various guides, all of which end up with a limited output or an error.
First of all, I tried a guide from Analyticsvidhya and this is the clearest code that I have obtained. I started with scraping only one page from the newspaper's archive:
library('rvest')
library('xml2')
library(dplyr)
url <- 'https://en.trend.az/archive/2021-11-03'
library("rvest")
html <- read_html(url)
headline_html <- html_nodes(html,'.category-article .article-title')
#144 articles according by (c)SelectorGadget
headline <- html_text(headline_html)
#print(headline)
length(headline)
I have tried similar codes for other CSS selectors, but I could not obtain more than 9 results.
I considered that the problem may be with the URL, so decided to scrape from a set of subpages covering several days in the archive.
This is a code according to an answer in the StackOverflow
all_df <- list()
arch_date <- seq(as.Date("2021-11-03"), as.Date("2021-11-13"), by="days")
for(i in 'rchdate'){
url_fonq <- str_c ('https://en.trend.az', "/archive/", arch_date)
webpage_fonq <- read_html(url_fonq)
head(webpage_fonq)
headline_html <- html_nodes(webpage_fonq,'.category-article .article-title')
headline <- html_text(headline_html)
head(headline)
headline <- str_trim(headline)
head(headline)
length(headline)
... (omit here similar commands for other nodes)
fonq.df <- data.frame( Num = row_number,
Date = date,
Time = time,
Title = headline,
Category = cat)
all_df <-bind_rows(all_df, fonq.df)
}
and this is an error that I could not fix:
Error:
x
must be a string of length 1 7. stop("x
must be a string of length 1", call. = FALSE) 6. read_xml.character(x, encoding = encoding, ..., as_html = TRUE, options = options) 5. read_xml(x, encoding = encoding, ..., as_html = TRUE, options = options) 4. withCallingHandlers(expr, warning = function(w) if (inherits(w, classes)) tryInvokeRestart("muffleWarning")) 3. suppressWarnings(read_xml(x, encoding = encoding, ..., as_html = TRUE, options = options)) 2. read_html.default(url_fonq)
- read_html(url_fonq)
Before I tried a more detailed but ambiguous guide for beginners from the DataCamp, which did end up with an unresolved error.
url <- 'https://en.trend.az/archive/2021-11-03'
headline_html <- read_html(url)
get_headline <- function(html){
html %>%
# The relevant tag
html_nodes('.category-article .article-title') %>%
html_text() %>%
# Trim additional white space - important function
str_trim() %>%
# Convert the list into a vector
unlist()
}
... (omit here similar commands for other nodes)
get_data_table <- function(html, company_name){
headline <- get_headline(html)
time <- get_time(html)
combine_data <- tibble(Abstract = headline,
Date = time
)
combined_data %>%
mutate(Trend.AZ = company_name) %>%
select(Trend.AZ, Abstract, Date)
}
get_data_from_url <- function(url, company_name){
html <- read_html(url)
get_data_table(html, company_name)
}
scrape_write_table <- function(url, company_name){
url <- "https://en.trend.az"
arch_date <- seq(as.Date("2021-10-01"), as.Date("2021-11-01"), by="days")
list_of_url <- str_c (url, "/archive/", arch_date)
list_of_url %>%
map(get_data_from_url, company_name) %>%
bind_rows() %>%
write_tsv(str_c(company_name,'.tsv'))
}
scrape_write_table(url, 'Trend.AZ')
# !!!The error was after here!!!
trend_az_tbl <- read_tsv('Trend.AZ')
tail(amz_tbl, 11)
The error:
Error in html_elements(...) : object 'tmp' not found 15. html_elements(...) 14. html_nodes(., ".category-article .article-date") 13.
*tmp*
%>% html_nodes(".category-article .article-date") 12. get_time(html) 11. get_data_table(html, company_name) 10. .f(.x[[i]], ...) 9. map(., get_data_from_url, company_name) 8. list2(...) 7. bind_rows(.) 6. is.data.frame(x) 5. stopifnot(is.data.frame(x)) 4. write_delim(x, file, delim = "\t", na = na, append = append, col_names = col_names, quote = quote, escape = escape, eol = eol, num_threads = num_threads, progress = progress) 3. write_tsv(., str_c(company_name, ".tsv")) 2. list_of_url %>% map(get_data_from_url, company_name) %>% bind_rows() %>% write_tsv(str_c(company_name, ".tsv"))
- scrape_write_table(url, "Trend.AZ")
I would be extremely thankful for any comment or suggestions regarding any of these 3 codes. I am really in a hurry to move to the analysis part of the project to be able to generate a report by the end of the course.