0

Looking into a web crawler that will go through multiple indeed.com country URLs

I have the first part of the code that crawls through individual pages below:

library(tidyverse)
library(rvest)
library(xml2)
library(dplyr)
library(stringr)

listings<- data.frame(title=character(),
                      company=character(),
                      stringAsFactors = FALSE)

for(i in seq(0,500,10)){
   url_ds<-paste0('https://www.indeed.com/jobs?q=data+analyst&l=&radius=25&start=',i)
   var <-read_html(url_ds)

#job title
title<- var %>%
    html_nodes('#resultsCol .jobtitle') %>%
    html_text() %>%
    str_extract("(//w+,+)+")

#company
    company<- var %>%
    html_nodes('#resultsCol .company') %>%
    html_text() %>%
    str_extract("(//w+,+)+")

 listings<-rbind(listings, as.data.frame(cbind(company,
                                          title)))
 }

What I would like to do is also loop through an array of the different country urls at the beginning of the "url_ds" above using a url_basic_list below and add a column for the actual country. basically I would need to create a loop within a loop for a text string, what is the best way to do so?

url_basic_list<-
     c("http://www.indeed.com",
     "http://www.indeed.com.hk",
     "http://www.indeed.com.sg"
     )

country<-
     c("USA",
     "Hong Kong",
     "Singapore"
     )
indy anahh
  • 33
  • 4

1 Answers1

0

Two suggestions:

  • change your for loop to lapply; this is mostly because iteratively adding rows to a data.frame starts out okay but gets slower and more memory-intensive with each pass through the loop. (For each rbind, it has to copy all of the contents in memory, so your memory needs are at least double the size of the frame.) By using lapply, it creates a list of data.frames (read the link!), which is created and filled memory-efficiently (as much as R can do), and then we do a single rbind at the end on the whole dataset.

  • functionize this, and call the country code (cc) as a function argument.

get_indeed <- function(cc = "") {
  dotcc <- if (cc == "us") "" else paste0(".", cc)

  listings_list <- lapply(seq(0, 500, by = 10), function(i) {
    url_ds <- sprintf('https://www.indeed.com%s/jobs?q=data+analyst&l=&radius=25&start=%i', dotcc, i)
    var <- read_html(url_ds)

    #job title
    title <- var %>%
      html_nodes('#resultsCol .jobtitle') %>%
      html_text() %>%
      str_extract("(//w+,+)+")

    #company
    company <- var %>%
      html_nodes('#resultsCol .company') %>%
      html_text() %>%
      str_extract("(//w+,+)+")

    data.frame(company, title)
  })
  listings <- do.call(rbind, listings_list)
  listings$cc <- if (nzchar(cc)) cc else ""
  listings
}

From here, to "loop" through a series of countries, one might do

all_countries <- lapply(c("us", "hk", "sg"), get_indeed)
all_countries <- do.call(rbind, all_countries)

From here, all of your $cc values will be the two-letter codes, which is fine. The bring in the full names, I suggest you have a simple data.frame to map one to the other:

countries <- data.frame(
  cc = c("us", "hk", "sg"),
  country = c("USA", "Hong Kong", "Singapore")
)
all_countries <- merge(all_countries, countries, by = "cc")

And your data will now have both $cc (two-letter) and $country (full words).

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thanks, I'm getting an error "Error in Open.Connection(x, "rb") – indy anahh Mar 03 '21 at 07:05
  • "SNI or certificate failed: SEC_E_WRONG_PRINCIPAL (0x80090322) - the target is incorrect" Any thoughts? – indy anahh Mar 03 '21 at 07:07
  • If you look up that error (it's not uncommon), a common cause is that your client (browser or `httr`) does not trust the certificate. If you go to `indeed.com.hk` *yourself* and examine its certificate, you should notice that the SSL certificate is meant for `*.parkingcrew.net`, suggesting that indeed.com.hk is either not being used properly or is being mis-routed. – r2evans Mar 03 '21 at 13:22