0

I hope you are having a good day.

I'm trying to scrape Trustpilot-reviews in the sports-section.

I want four columns with number of reviews, trustscore, subcategories and companynames. There are 43 pages it should iterate over, with 20 companies in each page. After an iteration the data should be placed underneath the previous data. This can be cleaned up afterwards using filtering though. The important part, and what I suspect is my problem is getting everything put together at the end.

The code as-is produce the error "Error in .subset2(x, i, exact = exact) : subscript out of bounds"

If you know anything about this, some pointers on how the code can be corrected would be appreciated.

Here is the code I'm having trouble with:

Trustpilot_company_data <- data.frame()
page_urls = sprintf('https://dk.trustpilot.com/categories/sports?page=%s&status=all', 2:43)
page_urls = c(page_urls, 'https://dk.trustpilot.com/categories/sports?status=all')
for (i in 1:length(page_urls)) {
  
  session <- html_session(page_urls[i])

  trustscore_data_html <- html_nodes(session,'.styles_textRating__19_fv')
  trustscore_data <- html_text(trustscore_data_html)
  trustscore_data <- gsub("anmeldelser","",trustscore_data) 
  trustscore_data <- gsub("TrustScore","",trustscore_data)
  trustscore_data <- as.data.frame(trustscore_data)
  trustscore_data <- separate(trustscore_data, col="trustscore_data", sep="·", into=c("antal anmeldelser", "trustscore"))
 
 number_of_reviews<- trustscore_data$`antal anmeldelser`
  Trustpilot_company_data[[i]]$number_of_reviews <- trimws(number_of_reviews, whitespace = "[\\h\\v]") %>% 
    as.numeric(number_of_reviews)

  trustscores <- trustscore_data$trustscore
  Trustpilot_company_data[[i]]$trustscores <- trimws(trustscores, whitespace = "[\\h\\v]") %>% 
    as.numeric(trustscores)

  subcategories_data_html <- html_nodes(session,'.styles_categories__c4nU-')
  subcategories_data <- html_text(subcategories_data_html)

  Trustpilot_company_data[[i]]$subcategories_data <- gsub("·",",",subcategories_data)
  company_name_data_html <- html_nodes(session,'.styles_businessTitle__1IANo')
  Trustpilot_company_data[[i]]$company_name_data <- html_text(company_name_data_html)

  Trustpilot_company_data[[i]]$company_name_data <- rep(i,length(Trustpilot_company_data[[i]]$company_name_data))
}

Best regards Anders

Anders Jørgensen
  • 195
  • 1
  • 1
  • 9

1 Answers1

1

There seem to be several things going on here.

First, as a rule, growing a data frame this way is not good practice.

Second, in this case you seem to be trying to add the new element for each column one at a time, which makes things more awkward for you. And you are trying to access the data frame as if it were a list. So, for example, this isn't going to work:

Trustpilot_company_data[[i]]$number_of_reviews <- trimws(number_of_reviews, whitespace = "[\\h\\v]")

Trustpilot_company_data is a data frame, so it has rows and columns. So to access a particular row and column with [] you say e.g. dat[5,10] for the fifth row and tenth column of dat. Instead you are trying to use [[i]] which is the syntax for accessing the elements of a list. In this case you'd need to write e.g.

Trustpilot_company_data[i, "number_of_reviews"]

to access the thing you're trying to get at.

Third, doing this one column at a time is a bad idea. If you're going to try to grow a data frame, assemble each new mini-data-frame completely first and then add it to the bottom with rbind(). E.g.,

df <- data.frame()

for(i in 1:5) {
  new_piece <- data.frame(a = i, 
                          b = i, 
                          c = i)  
  df <- rbind(df, new_piece)
}

But fourth and most important, don't grow data frames in this way in the first place. Instead, see for example this answer.

Kieran
  • 1,213
  • 10
  • 9