0

I want to extract data from different hyperlinks of this web page

I was using the following code to extract table of the hyperlink.

url <- "https://www.maritime-database.com/company.php?cid=66304"
webpage<-read_html(URL)

df <- 
webpage %>% 
html_node("table") %>% 
html_table(fill=TRUE)

From this code, I was able to extract all hyperlinks in a table but I don't have any idea how to extract data from this hyperlink.

EX- for this link I want to extract data as given in figure[![Data from the link provided in example][1]][1]

=

xyz
  • 79
  • 1
  • 8

1 Answers1

1

Let's start by loading some libraries we will need..

library(rvest)
library(tidyverse)
library(stringr)

Then, we can open the desired page and extract all links:

url <- "https://www.maritime-database.com/company.php?cid=66304"
webpage<-read_html(url)
urls <- webpage %>% html_nodes("a") %>% html_attr("href")

Let's take a look at what we uncovered...

> head(urls,100)
  [1] "/"                               "/areas/"
  [3] "/countries/"                     "/ports/"
  [5] "/ports/topports.php"             "/addcompany.php"
  [7] "/aboutus.php"                    "/activity.php?aid=28"
  [9] "/activity.php?aid=9"             "/activity.php?aid=16"
 [11] "/activity.php?aid=24"            "/activity.php?aid=27"
 [13] "/activity.php?aid=29"            "/activity.php?aid=25"
 [15] "/activity.php?aid=5"             "/activity.php?aid=11"
 [17] "/activity.php?aid=19"            "/activity.php?aid=17"
 [19] "/activity.php?aid=2"             "/activity.php?aid=31"
 [21] "/activity.php?aid=1"             "/activity.php?aid=13"
 [23] "/activity.php?aid=23"            "/activity.php?aid=18"
 [25] "/activity.php?aid=22"            "/activity.php?aid=12"
 [27] "/activity.php?aid=4"             "/activity.php?aid=26"
 [29] "/activity.php?aid=10"            "/activity.php?aid=14"
 [31] "/activity.php?aid=7"             "/activity.php?aid=30"
 [33] "/activity.php?aid=21"            "/activity.php?aid=20"
 [35] "/activity.php?aid=8"             "/activity.php?aid=6"
 [37] "/activity.php?aid=15"            "/activity.php?aid=3"
 [39] "/africa/"                        "/centralamerica/"
 [41] "/northamerica/"                  "/southamerica/"
 [43] "/asia/"                          "/caribbean/"
 [45] "/europe/"                        "/middleeast/"
 [47] "/oceania/"                       "company-contact.php?cid=66304"
 [49] "http://www.quadrantplastics.com" "/company.php?cid=313402"
 [51] "/company.php?cid=262400"         "/company.php?cid=262912"
 [53] "/company.php?cid=263168"         "/company.php?cid=263424"
 [55] "/company.php?cid=67072"          "/company.php?cid=263680"
 [57] "/company.php?cid=67328"          "/company.php?cid=264192"
 [59] "/company.php?cid=67840"          "/company.php?cid=264448"
 [61] "/company.php?cid=264704"         "/company.php?cid=68352"
 [63] "/company.php?cid=264960"         "/company.php?cid=68608"
 [65] "/company.php?cid=265216"         "/company.php?cid=68864"
 [67] "/company.php?cid=265472"         "/company.php?cid=200192"
 [69] "/company.php?cid=265728"         "/company.php?cid=69376"
 [71] "/company.php?cid=200448"         "/company.php?cid=265984"
 [73] "/company.php?cid=200704"         "/company.php?cid=266240"

After some inspection, we find that we are only interested in urls that start with /company.php

Let's then figure out how many of them are there, and create a placeholder list for our results:

numcompanies <- length(which(!is.na(str_extract(urls, '/company.php'))))
mylist = vector("list", numcompanies )

We find that there are 40034 company urls we need to scrape. This will take a while...

> numcompanies
  40034

Now, it's just a matter of looping through each matching url one by one, and saving the text.

i = 0
for(u in urls){


   if(!is.na(str_match(u, '/company.php'))){
        Sys.sleep(1)
        i = i + 1

        companypage <-read_html(paste0('https://www.maritime-database.com', u))
        cat(paste('page nr', i, '; saved text from: ', u, '\n'))
        text <- companypage %>%
                html_nodes('.txt') %>% html_text()

        names(mylist)[i] <- u
        mylist[[i]] <- text
    }
}

In the loop above, we have taken advantage of the observation that the info we want always has class="txt" (see screenshot below).

Assuming that opening a page takes around 1 second, scraping all pages will take approximately 11 hours.

Also, keep in mind the ethics of web scraping.

screenshot

Otto Kässi
  • 2,943
  • 1
  • 10
  • 27
  • Thanks a lot for your answer. Is there any way to convert the extracted file in .tsv or .csv – xyz May 19 '20 at 10:46
  • Hi! Glad to have helped! If you find my answer satisfactory, could I ask you to approve it? – Otto Kässi May 19 '20 at 11:00
  • Re: writing output to a text file -- there are various SO answers on that. See, e.g., https://stackoverflow.com/questions/19330949/r-how-to-save-lists-into-csv or https://stackoverflow.com/questions/48120782/r-write-list-to-csv-line-by-line – Otto Kässi May 19 '20 at 11:01
  • I have added an additional Sys.sleep(1) to my answer to throttle down the number of requests made to the website a bit – Otto Kässi May 19 '20 at 12:01
  • can we directly extract urls data in tsv or csv – xyz May 19 '20 at 13:07
  • urls: write.csv(names(mylist), file='urls.txt'), contents write.csv(unlist(mylist), file='contents.txt') – Otto Kässi May 19 '20 at 13:17