R webscraping - links and urls

Question

I have a problem similar to Scraping a web page, links on a page, and forming a table with R . I would have posted this as a comment to that topic but I am not scoring enough, yet.

I have the following code:

## Import web page
FAO_Countries <- read_html("http://www.fao.org/countryprofiles/en/")

## Import the urls I am interested in with 'selectorgadget'
FAO_Countries_urls <- FAO_Countries %>% 
 html_nodes(".linkcountry") %>% 
 html_attr("href")

## Import the links I am interested in with 'slectorgadget'
FAO_Countries_links <- FAO_Countries %>%
html_nodes(".linkcountry") %>% 
html_text()

## I create a dataframe with two previous objects
FAO_Countries_data <- data.frame(FAO_Countries_links = FAO_Countries_links, 
FAO_Countries_urls = FAO_Countries_urls, stringsAsFactors = FALSE)

At this point, I would like to pick up the text from the urls I have got and adding as a column in the right and do this for other things I need. Nevertheless, when I compile

FAO_Countries_data_text <- FAO_Countries_data$FAO_Countries_urls %>%
html_nodes("#foodSecurity-1") %>%
html_text()

I get the following error message:

Error in UseMethod("xml_find_all") : 
no applicable method for 'xml_find_all' applied to an object of class "character"

In other words, I cannot grab links from the new-made dataframe.

Now, I have a dataframe that appears as follows:

> head(FAO_Countries_data, n=3)
  FAO_Countries_links                  FAO_Countries_urls
  1         Afghanistan /countryprofiles/index/en/?iso3=AFG
  2             Albania /countryprofiles/index/en/?iso3=ALB
  3             Algeria /countryprofiles/index/en/?iso3=DZA

I would to expand this data frame by adding columns including info that are present in the various urls. e.g. :

FAO_Countries_links                  FAO_Countries_urls      Food_security
  1         Afghanistan /countryprofiles/index/en/?iso3=AFG Family farming

You provide `html_nodes` with a character vector, but it expects either a document, a node set or a single node. It's unclear what you mean by _" the text from the urls"_, as you got the anchor texts already in `FAO_Countries_data$FAO_Countries_links` (?). — lukeA, Dec 07 '16 at 12:20
Thanks for the reply. Nevertheless, I don't get why I provide html_nodes() with a character vector. — Ileeo, Dec 07 '16 at 13:13
`FAO_Countries_data$FAO_Countries_urls` is a character vector (a bunch of strings), not a node set (a special xml object). Not much to get about it. So do you mind to tell what you mean by "the text from the urls"? Or, in other words, what do you want your final result to look like? — lukeA, Dec 07 '16 at 13:33
Ok, now I have two columns in my dataframe. One for the link, the other for the urls. In any urls page there are some parts whose text I would like to extrapolate and put into a column close to the ones I already have. For instance, in any country there is a sector dedicated to food security whose text I would like to add to my dataset. I hope I made myself clear. Thanks a lot — Ileeo, Dec 08 '16 at 08:18
Ok, so you want to `read_html` each of your links and extract further info from there. What would be the text for e.g. "#foodSecurity-1" in "http://www.fao.org/countryprofiles/index/en/?iso3=ECU"? — lukeA, Dec 08 '16 at 10:55
The text would be - for each and every country - "#foodSecurity-1 a" (i.e. the Food security and safety paragraph) — Ileeo, Dec 08 '16 at 11:11
If I open http://www.fao.org/countryprofiles/index/en/?iso3=ECU and view its source, then I see no "form" value. — lukeA, Dec 08 '16 at 13:19
Voting to close, as this leads to nothing. Edit your post and add the desired output data frame (or part of it). — lukeA, Dec 08 '16 at 13:59
You might want to consider using RSelenium as the pages load these data dynamically using javascript. — lukeA, Dec 08 '16 at 14:22

Emmanuel Hamel · Answer 1 · 2022-02-28T22:46:21.093

Using the code below, I am able to extract the text of the "news item", "gsa-publication" and "projectsCountry" of 5 countries :

library(stringr)
library(rvest)
library(RDCOMClient)

## Import web page
FAO_Countries <- read_html("http://www.fao.org/countryprofiles/en/")
FAO_Countries_urls <- FAO_Countries %>% html_nodes(".linkcountry") %>% html_attr("href")
FAO_Countries_links <- FAO_Countries %>% html_nodes(".linkcountry") %>% html_text()
FAO_Countries_data <- data.frame(FAO_Countries_links = FAO_Countries_links, 
                                 FAO_Countries_urls = FAO_Countries_urls, stringsAsFactors = FALSE)

url <- paste0("http://www.fao.org", FAO_Countries_data$FAO_Countries_urls) 

IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
list_News_Text <- list()
list_GSA_Publication <- list()
list_ProjectsCountry <- list()

for(i in 1 : 5)
{
  print(i)
  IEApp$Navigate(url[i])
  
  Sys.sleep(10)
  
  doc <- IEApp$Document()
  html_Content <- doc$documentElement()$innerText()
  web_Obj <- doc$getElementByID("newsItems")
  list_News_Text[[i]] <- web_Obj$innerText()
  web_Obj <- doc$getElementByID("gsa-publications")
  list_GSA_Publication[[i]] <- web_Obj$innerText()
  web_Obj <- doc$getElementByID("projectsCountry")
  list_ProjectsCountry[[i]] <- web_Obj$innerText()
}

print(list_News_Text)

You can use a similar approach to extract other items of the different web pages.

R webscraping - links and urls

1 Answers1