How to extract data from html into R

Question

I have a link that contents a table. First thing I tried was finding if there is any button to click and unfortunately there isn't. Then I tried to use a package called XML in R to fetch the data between different nodes to build up a data frame by myself.

In order to do this I need to know which node (or HTML tag) I would like to extracting. So I right click on the web browser and find the tag that contains the table I want.

From <fieldset id="result" starts the content of the table. We can also see from the browser the first row of the table is <li class="vesselResultEntry removeBackground">.

Then when I was trying to use R to download this HTML, I found the whole <li> tags that relating to the table are gone and replaced by <li class="toRemove"/>. Here is my R code below by the way:

library(XML)
url <- "http://www.fao.org/figis/vrmf/finder/search/#stats"
webpage <- readLines(url)
htmlpage <- htmlParse(webpage, asText = TRUE)
data <- xpathSApply(htmlpage, "//ul[@id='searchResultsContainer']")
data

# <ul id="searchResultsContainer" class="clean resultsContainer"><li class="toRemove"></li></ul>

What I'm trying to do in the code is simply to see if I can fetch the content in a specific tag. Clearly the row I want to fetch is not in the object (webpage)I saved.

So my questions are:

Is there a way to download the table I want by any means (Ideally in R)?

Is there some kind of protection in this website that prevents me from downloading the whole HTML as a text file and fetch data?

Much appreciate for any suggestions

Seems like a duplicate of http://stackoverflow.com/questions/23028760/download-a-file-from-https-using-download-file — Ouroborus, Jan 19 '16 at 03:59
Look into using xPath, which is a language-independent way to query an XML structure. By the way, you never told us what you're actually after here. — Tim Biegeleisen, Jan 19 '16 at 04:04
I'm actually trying to download the whole table content you see in that link. If it is not downloadable, I would like to fetch the data by specifying the tag names. But now it seems when I save the html as a text file, the tags contain each rows of that table have gone. — Lambo, Jan 19 '16 at 04:09
@Ouroborus, Thanks for the example, but the link I provided here is not a shared csv file link. It is just a link of webpage. I'm not sure if it works. — Lambo, Jan 19 '16 at 04:12

score 2 · Accepted Answer · answered Jan 19 '16 at 04:13

The page you're trying to fetch is assembled dynamically on the browser side on load. The content you get by directly fetching the url does not contain the data you see when you view the DOM. That data is loaded later from a separate URL.

I took a look and the URL in question is:

http://www.fao.org/figis/vrmf/finder/services/public/vessels/search?c=true&gd=true&nof=false&not=false&nol=false&ps=30&o=0&user=NOT_SET

I'm not sure what most of the query string is, but it's clear that ps is "page size" and o is "offset". Page size seems to cap at 200 above which it is forced to 30. The URL returns JSON so you'll need some way to parse that. The data embedded in the responses says there are 231047 entries so you'll have to make multiple requests to get it all.

Data providers usually do not appreciate people scraping their data like that. You might want to look around for a downloadable version.

Thanks @Ouroborus. It's really good to know there is such a way of seeing the data on this website. But I think I will stop my exploring. 231047 is just too large. — Lambo, Jan 20 '16 at 02:02

How to extract data from html into R

1 Answers1