Scraping specific details/columns in textual data (rvest)

Question

I am relatively new to webscraping and I am interested in scraping textual data from an online social forum. I was able to successfully scrape text but I am unable to organize and gather specific details from the textual data.

Currently, my code is as follows:

library(tidyverse)
library(rvest)


# Scrape posts 
pages <- 1:32

hardwarezone_list=list()

for(i in seq_along(pages)){  hardwarezone_link<-paste0("https://forums.hardwarezone.com.sg/threads/glgt-you-can-see-that-f-b-jobs-are-really-not-on-top-of-the-minds-of-singaporeans.6404486/","page-",i)
hardwarezone_page<-read_html(hardwarezone_link)  
hardwarezone_list[[i]] <- hardwarezone_page  %>% html_nodes(".bbWrapper")  %>% html_text()}
hardwarezone_table <- do.call(rbind,hardwarezone_list)
hardwarezone_table<- as.data.frame(hardwarezone_table)
#print data example

dput(hardwarezone_table[1:2,c(1,2)])

# output:
structure(list(V1 = c(" https://www.channelnewsasia.com/ne...bs-restaurant-association-13441340?cid=FBcna \n\"You can see that F&B jobs are really not on top of the minds of Singaporeans even when there's high unemployment,\" says a business owner.", 
"I guesss majority prefer to either send food or eat food .. not prepare the food. Haha"
), V2 = c("Recession and retrenchment only happen in EDMW ", 
"\n\t\n\t\t\n\t\t\t\n\t\t\t\ttokong said:\n\t\t\t\n\t\t\n\t\n\t\n\t\t\n\t\t\n\t\t\tno thanks, those people whose pop and mom are hawkers or have been hawkers will know. \nour parents will discourage us to become hawkers. better study hard and get a job.\nf and b jobs generate no values to your cv unless it is the end of the road for you.\nf and b pay is very jialat also. if the salary cannot feed your own family, why take the job?\nthose young punks who go into f and b either has the passion or enjoys the freedom of being not an employee\n\t\t\n\t\tClick to expand...\n\t\n\nyou will be shocked how much hawkers earn. even just those drink stall make kopi, teh kind and get soft drinks, ice from supplier and sell. don't mention bubble tea that one is considered quite artisanal.\nf&b has many positions, les amis executive chef also f&b, waitress also f&b, george quek also f&b. the value of CV is dependent on how a person wanna craft his career path, and not the industry."
)), row.names = 1:2, class = "data.frame")

However, ideally, I would like to scrape the data where each row/observation contains the following information, rather than just collecting data on posts which is the case with my code above.

username        post                                    date                     user status
tegridy_farm            why is that the case.                  3/10/2022               banned
Mackey                 why                                   3/10/2022             Senior member
eric cartman         kyle is bad                      3/10/2022             banned

I implemented the nice solution below as follows, which worked perfectly.

hardwarezone_scraper <- function(page_number) {
    # our base URL
    hardwarezone_link<-"https://forums.hardwarezone.com.sg/threads/glgt-you-can-see-that-f-b-jobs-are-really-not-on-top-of-the-minds-of-singaporeans.6404486/page-{page_number}"
    
    # read the html for each post
    messages <- read_html(glue::glue(hardwarezone_link)) %>%
    html_nodes(".message-inner")

    # get the information we want
    usernames <- messages %>%
    html_nodes(".message-name") %>%
        html_text()

    user_status <- messages %>% 
        html_nodes(".message-userTitle") %>%
        html_text()

    post_date <- messages %>%
        html_nodes(".listInline") %>%
        html_nodes(".u-dt") %>%
        html_text() %>%
        # example is "Nov 4, 2020"
        parse_date(format = "%b %d, %Y")

    post <- messages %>%
        html_nodes(".bbWrapper") %>%
        html_text()
    # combine into a dataframe and return
    tibble(
        username = usernames,
        post = post,
        date = post_date,
        `user status` = user_status
    )
}
hardwarezone_scraper(1)

margusl · Answer 1 · 2023-08-04T12:17:25.957

You could collect each of your desired attribute (username, post, date, user status) into a separate vector or list, hope that they all end up having the same length and combine those into a data.farme.

Perhaps a bit more robust method is to iterate through every post of every page, collect all desired details into a named list and build a list of lists where each item is a post; it's convenient to turn this structure into a tibble / data.frame:

library(dplyr)
library(purrr)
library(rvest)

url_ <- "https://forums.hardwarezone.com.sg/threads/glgt-you-can-see-that-f-b-jobs-are-really-not-on-top-of-the-minds-of-singaporeans.6404486/"
html <- read_html(url_)

# extract last pagenumber from first page
last_page <- html %>% 
  html_element("ul.pageNav-main > li:last-of-type") %>% 
  html_text() %>% 
  readr::parse_number()

# wrap read_html with slowly() to limit request rate
read_page <- \(page_no) paste0(url_, "page-", page_no) %>% read_html()
slow_read_page <- slowly(read_page)

# extract post details from a single page, return list of posts
parse_page <- function(html_page){
  html_elements(html_page, ".message--post") %>% 
      map(\(msg_post) 
          list( username = html_attr(msg_post, "data-author"),
                post   = html_element(msg_post, "div.message-content div.bbWrapper") %>% html_text2(),
                date   = html_element(msg_post, "time") %>% html_attr("datetime") %>% lubridate::as_datetime(),
                status = html_element(msg_post, ".message-userDetails > .userTitle") %>% html_text()
             )
          )
}

# 1st page is already loaded, 
# load rest of the pages,
# include first page in the list as a 1st item,
# apply parse_page on every loaded page,
# bind to tibble
2:last_page %>% 
  map(slow_read_page, .progress = T) %>% 
  append(list(html), after = 0) %>% 
  map(parse_page) %>% 
  bind_rows()

Resulting 171 × 4 tibble (9 pages, 20 post per page), timestamps in UTC:

#> # A tibble: 171 × 4
#>    username          post                             date                status
#>    <chr>             <chr>                            <dttm>              <chr> 
#>  1 Tough times ahead "https://www.channelnewsasia.co… 2020-11-03 23:04:27 Banned
#>  2 HeadQuarters      "Recession and retrenchment onl… 2020-11-03 23:05:28 High …
#>  3 Chitchatguy       "We study so hard to get degree… 2020-11-03 23:06:54 Supre…
#>  4 buttbERry         "trend seems like:\n\nf&b not o… 2020-11-03 23:09:11 Senio…
#>  5 Sultana           "Those suitable went for grab..… 2020-11-03 23:11:29 High …
#>  6 jacko123          "yup, f&b business owner here.\… 2020-11-03 23:17:58 Maste…
#>  7 Clearnfc          "Of course. Grab food can earn … 2020-11-03 23:18:52 Banned
#>  8 ramlee            "jacko123 said:\nyup, f&b busin… 2020-11-03 23:19:21 High …
#>  9 Chitchatguy       "jacko123 said:\nyup, f&b busin… 2020-11-03 23:19:43 Supre…
#> 10 madcampus         "F&B all done by Malaysians lia… 2020-11-03 23:19:51 Arch-…
#> # ℹ 161 more rows

^{Created on 2023-08-04 with reprex v2.0.2}

Thanks, I ran the code and the first part runs well but I received the error message upon running the part after "# 1st page is already loaded" "Error in f(...) : unused argument (.progress = TRUE) " — maldini1990, Aug 04 '23 at 12:22
Most likely your purrr version (or a whole Tidyverse installation) is bit older, `map()` knows about `.progress` since purrr 1.0, releases at the end of 2022. If you don't feel like updating, you may try to remove `.progress` , though you might run into some other version-related issues. — margusl, Aug 04 '23 at 12:34

Mark · Accepted Answer · 2023-08-04T12:40:37.230

Here is how I would do it:

# load the required packages (and if you don't have them installed, then they are installed and loaded automatically)
pacman::p_load(tidyverse, rvest, glue)

hardwarezone_scraper <- function(page_number) {
    # our base URL
    hardwarezone_link<-"https://forums.hardwarezone.com.sg/threads/glgt-you-can-see-that-f-b-jobs-are-really-not-on-top-of-the-minds-of-singaporeans.6404486/page-{page_number}"
    
    # read the html for each post
    messages <- read_html(glue::glue(hardwarezone_link)) %>%
    html_nodes(".message-inner")

    # get the information we want
    usernames <- messages %>%
    html_nodes(".message-name") %>%
        html_text()

    user_status <- messages %>% 
        html_nodes(".message-userTitle") %>%
        html_text()

    post_date <- messages %>%
        html_nodes(".listInline") %>%
        html_nodes(".u-dt") %>%
        html_text() %>%
        # example is "Nov 4, 2020"
        parse_date(format = "%b %d, %Y")

    post <- messages %>%
        html_nodes(".bbWrapper") %>%
        html_text()
    
    # combine into a dataframe and return
    tibble(
        username = usernames,
        post = post,
        date = post_date,
        `user status` = user_status
    )
}
hardwarezone_scraper(1)

The trick for scraping (if there is a trick) is to use the Chrome (or Firefox) inspector to look at the elements of the webpage, and find the identifiers for the elements you want- often similar things have the same class, as was the case here.

Thanks, this works well w/out errors, I tried saving all observations under one DF and added "|> hardwarezone_table<- as.data.frame(hardwarezone_table)" to your code but I received this error message: "Error: The pipe operator requires a function call as RHS " — maldini1990, Aug 04 '23 at 12:30
it sounds like you're having trouble with the native pipe @nesta1992. I removed it and replaced it with the normal, magrittr pipe ! :-) — Mark, Aug 04 '23 at 12:41

Scraping specific details/columns in textual data (rvest)

2 Answers2

Linked