Let's start by loading some libraries we will need..
library(rvest)
library(tidyverse)
library(stringr)
Then, we can open the desired page and extract all links:
url <- "https://www.maritime-database.com/company.php?cid=66304"
webpage<-read_html(url)
urls <- webpage %>% html_nodes("a") %>% html_attr("href")
Let's take a look at what we uncovered...
> head(urls,100)
[1] "/" "/areas/"
[3] "/countries/" "/ports/"
[5] "/ports/topports.php" "/addcompany.php"
[7] "/aboutus.php" "/activity.php?aid=28"
[9] "/activity.php?aid=9" "/activity.php?aid=16"
[11] "/activity.php?aid=24" "/activity.php?aid=27"
[13] "/activity.php?aid=29" "/activity.php?aid=25"
[15] "/activity.php?aid=5" "/activity.php?aid=11"
[17] "/activity.php?aid=19" "/activity.php?aid=17"
[19] "/activity.php?aid=2" "/activity.php?aid=31"
[21] "/activity.php?aid=1" "/activity.php?aid=13"
[23] "/activity.php?aid=23" "/activity.php?aid=18"
[25] "/activity.php?aid=22" "/activity.php?aid=12"
[27] "/activity.php?aid=4" "/activity.php?aid=26"
[29] "/activity.php?aid=10" "/activity.php?aid=14"
[31] "/activity.php?aid=7" "/activity.php?aid=30"
[33] "/activity.php?aid=21" "/activity.php?aid=20"
[35] "/activity.php?aid=8" "/activity.php?aid=6"
[37] "/activity.php?aid=15" "/activity.php?aid=3"
[39] "/africa/" "/centralamerica/"
[41] "/northamerica/" "/southamerica/"
[43] "/asia/" "/caribbean/"
[45] "/europe/" "/middleeast/"
[47] "/oceania/" "company-contact.php?cid=66304"
[49] "http://www.quadrantplastics.com" "/company.php?cid=313402"
[51] "/company.php?cid=262400" "/company.php?cid=262912"
[53] "/company.php?cid=263168" "/company.php?cid=263424"
[55] "/company.php?cid=67072" "/company.php?cid=263680"
[57] "/company.php?cid=67328" "/company.php?cid=264192"
[59] "/company.php?cid=67840" "/company.php?cid=264448"
[61] "/company.php?cid=264704" "/company.php?cid=68352"
[63] "/company.php?cid=264960" "/company.php?cid=68608"
[65] "/company.php?cid=265216" "/company.php?cid=68864"
[67] "/company.php?cid=265472" "/company.php?cid=200192"
[69] "/company.php?cid=265728" "/company.php?cid=69376"
[71] "/company.php?cid=200448" "/company.php?cid=265984"
[73] "/company.php?cid=200704" "/company.php?cid=266240"
After some inspection, we find that we are only interested in urls that start with /company.php
Let's then figure out how many of them are there, and create a placeholder list for our results:
numcompanies <- length(which(!is.na(str_extract(urls, '/company.php'))))
mylist = vector("list", numcompanies )
We find that there are 40034 company urls we need to scrape. This will take a while...
> numcompanies
40034
Now, it's just a matter of looping through each matching url one by one, and saving the text.
i = 0
for(u in urls){
if(!is.na(str_match(u, '/company.php'))){
Sys.sleep(1)
i = i + 1
companypage <-read_html(paste0('https://www.maritime-database.com', u))
cat(paste('page nr', i, '; saved text from: ', u, '\n'))
text <- companypage %>%
html_nodes('.txt') %>% html_text()
names(mylist)[i] <- u
mylist[[i]] <- text
}
}
In the loop above, we have taken advantage of the observation that the info we want always has class="txt"
(see screenshot below).
Assuming that opening a page takes around 1 second, scraping all pages will take approximately 11 hours.
Also, keep in mind the ethics of web scraping.
