2

"I have carried out scraping in R but facing the problem in splitting the data into different columns. I am not able to write the code for column 8:10 (last line of the code). Below is the code"

library(xml2)
library(rvest)
library(stringr)
library(tidyr)

reddit_wbpg <- read_html("https://www.tripadvisor.in/Hotel_Review-g304551-d3583700-Reviews-or10-Lemon_Tree_Premier_Delhi_Airport-New_Delhi_National_Capital_Territory_of_Delhi.html")


title <- reddit_wbpg %>%
  html_node("title") %>%
  html_text()  

reviews <- reddit_wbpg %>%
  html_nodes("q.location-review-review-list-parts-ExpandableReview__reviewText--gOmRC") %>%
  html_text()  

user_data1 <- reddit_wbpg %>%
  html_nodes("div.social-member-event-MemberEventOnObjectBlock__event_type--3njyv") %>%
  html_text()

user_data2 <- reddit_wbpg %>%
  html_nodes("div.social-member-MemberHeaderStats__event_info--30wFs") %>%
  html_text()

review_title <- reddit_wbpg %>%
  html_nodes("div.location-review-review-list-parts-ReviewTitle__reviewTitle--2GO9Z") %>%
  html_text()


scraping_data <- data.frame(page_title= title, review_title = review_title, reviews = reviews, user_data1 = user_data1,user_data2 = user_data2)

scraping_data <- cbind(scraping_data,"a","a","a","a","a")
colnames(scraping_data)[6:10] <- c("user_name", "date", "location", "contribution" , "helpful_votes")


scraping_data[,6:7] <-   str_split_fixed(scraping_data$user_data1, " wrote a review", 2)
scraping_data[,8] <- str_extract(scraping_data$user_data2,"^.+?(?=[0-9]+ [hc])")
scraping_data[,9] <- str_extract(scraping_data$user_data2,"[0-9]+(?= contributions)")
scraping_data[,10] <- str_extract(scraping_data$user_data2,"[0-9]+(?= helpful votes)") 

The output can be seen here in the image attached:

Error in Row 1

Aman
  • 93
  • 5
  • Per `r` tag (hover or click to see): Please provide [minimal and reproducible example(s)](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5965451) along with the desired output. Use `dput()` for data and specify all non-base packages with `library()` calls. Also, do not use [images](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question/285557#285557). – Parfait Apr 29 '20 at 14:37

1 Answers1

1

Here's one approach with str_extract using both positive and negative lookahead:

scraping_data[,8] <- str_extract(scraping_data$user_data2,"^(?![0-9]+ (con|hel)).+?(?=[0-9]+ (con|hel)|$)")
scraping_data[,9] <- str_extract(scraping_data$user_data2,"[0-9]+(?= contribution)")
scraping_data[,10] <- str_extract(scraping_data$user_data2,"[0-9]+(?= helpful vote)")
scraping_data
#                               user_data1                                     user_data2 user_name date         location contribution helpful_votes
#1 mohd saqibsaqib wrote a review Mar 2020                 2 contributions2 helpful votes         a    a             <NA>            2             2
#2        hitesh k wrote a review Mar 2020                  4 contributions1 helpful vote         a    a             <NA>            4             1
#3          Basant wrote a review Mar 2020                                2 contributions         a    a             <NA>            2          <NA>
#4          RagP65 wrote a review Mar 2020 New Delhi, India9 contributions4 helpful votes         a    a New Delhi, India            9             4
#5          Mbosma wrote a review Mar 2020                                2 contributions         a    a             <NA>            2          <NA>
Ian Campbell
  • 23,484
  • 14
  • 36
  • 57
  • 90% of the job looks done. Thanks. There are a couple of errors that are still existing. 1) In row 4, 1 contribution isn't getting captured (code is for contributions only and contribution needs to be added. 2) I am doing scraping for multiple pages. I did it for another page https://www.tripadvisor.in/Hotel_Review-g304551-d3583700-Reviews-or10-Lemon_Tree_Premier_Delhi_Airport-New_Delhi_National_Capital_Territory_of_Delhi.html Row 1 is not getting split properly in this case. Please help. – Aman Apr 29 '20 at 19:20
  • I edited my answer to fix issue 1 by removing the `s` from the look ahead. For issue 2, please edit your question with a reproducible example of the issue. – Ian Campbell Apr 29 '20 at 19:32
  • I have made edits in the code as suggested and attached a snapshot for your reference. – Aman Apr 29 '20 at 19:42