Looking into a web crawler that will go through multiple indeed.com country URLs
I have the first part of the code that crawls through individual pages below:
library(tidyverse)
library(rvest)
library(xml2)
library(dplyr)
library(stringr)
listings<- data.frame(title=character(),
company=character(),
stringAsFactors = FALSE)
for(i in seq(0,500,10)){
url_ds<-paste0('https://www.indeed.com/jobs?q=data+analyst&l=&radius=25&start=',i)
var <-read_html(url_ds)
#job title
title<- var %>%
html_nodes('#resultsCol .jobtitle') %>%
html_text() %>%
str_extract("(//w+,+)+")
#company
company<- var %>%
html_nodes('#resultsCol .company') %>%
html_text() %>%
str_extract("(//w+,+)+")
listings<-rbind(listings, as.data.frame(cbind(company,
title)))
}
What I would like to do is also loop through an array of the different country urls at the beginning of the "url_ds" above using a url_basic_list below and add a column for the actual country. basically I would need to create a loop within a loop for a text string, what is the best way to do so?
url_basic_list<-
c("http://www.indeed.com",
"http://www.indeed.com.hk",
"http://www.indeed.com.sg"
)
country<-
c("USA",
"Hong Kong",
"Singapore"
)