I hope you are having a good day.
I'm trying to scrape Trustpilot-reviews in the sports-section.
I want four columns with number of reviews, trustscore, subcategories and companynames. There are 43 pages it should iterate over, with 20 companies in each page. After an iteration the data should be placed underneath the previous data. This can be cleaned up afterwards using filtering though. The important part, and what I suspect is my problem is getting everything put together at the end.
The code as-is produce the error "Error in .subset2(x, i, exact = exact) : subscript out of bounds"
If you know anything about this, some pointers on how the code can be corrected would be appreciated.
Here is the code I'm having trouble with:
Trustpilot_company_data <- data.frame()
page_urls = sprintf('https://dk.trustpilot.com/categories/sports?page=%s&status=all', 2:43)
page_urls = c(page_urls, 'https://dk.trustpilot.com/categories/sports?status=all')
for (i in 1:length(page_urls)) {
session <- html_session(page_urls[i])
trustscore_data_html <- html_nodes(session,'.styles_textRating__19_fv')
trustscore_data <- html_text(trustscore_data_html)
trustscore_data <- gsub("anmeldelser","",trustscore_data)
trustscore_data <- gsub("TrustScore","",trustscore_data)
trustscore_data <- as.data.frame(trustscore_data)
trustscore_data <- separate(trustscore_data, col="trustscore_data", sep="·", into=c("antal anmeldelser", "trustscore"))
number_of_reviews<- trustscore_data$`antal anmeldelser`
Trustpilot_company_data[[i]]$number_of_reviews <- trimws(number_of_reviews, whitespace = "[\\h\\v]") %>%
as.numeric(number_of_reviews)
trustscores <- trustscore_data$trustscore
Trustpilot_company_data[[i]]$trustscores <- trimws(trustscores, whitespace = "[\\h\\v]") %>%
as.numeric(trustscores)
subcategories_data_html <- html_nodes(session,'.styles_categories__c4nU-')
subcategories_data <- html_text(subcategories_data_html)
Trustpilot_company_data[[i]]$subcategories_data <- gsub("·",",",subcategories_data)
company_name_data_html <- html_nodes(session,'.styles_businessTitle__1IANo')
Trustpilot_company_data[[i]]$company_name_data <- html_text(company_name_data_html)
Trustpilot_company_data[[i]]$company_name_data <- rep(i,length(Trustpilot_company_data[[i]]$company_name_data))
}
Best regards Anders