4
library(rvest)

urls <- c("https://www.r-bloggers.com", "https://www.stackoverflow.com")

docsFor <- list()
for(url in urls){
  docsFor[[url]] <- read_html(url)
}
docsFor[[1]]

Question: How can i do the same with sapply/vapply?

Using sapply(urls, read_html) will not work, therefore i went for vapply. If i am correct i would Need sthg like externalptr(0), but i am not sure if that exists.

Doesnt work, because there is no externalptr()?:

docs <- vapply(urls, read_html, FUN.VALUE = list(externalptr(0), externalptr(0)))
docs[[1]]
Tlatwork
  • 1,445
  • 12
  • 35
  • 1
    What about?: `docsFor <- lapply(urls, read_html)` – Vitali Avagyan Sep 22 '19 at 15:10
  • 1
    *will not work* ... what is the error or undesired results? Sapply/vapply are used to return simplified objects (atomic vectors, matrices, arrays) and not non-simplified objects like data frames and lists. By the way, `read_html` derives from `xml2` not `rvest` and its return would be a complex type. – Parfait Sep 22 '19 at 15:18
  • ups, yeah that works thanks. I tried `docs <- vapply(urls, read_html, FUN.VALUE = list())`, but that didnt work out.Thanks again! – Tlatwork Sep 22 '19 at 15:18
  • `sapply` and `lapply` should be perfectly suited for this. You'll have to elaborate a bit more on what exactly the problem is with this approach as you see it :) – MichaelChirico Sep 22 '19 at 15:23
  • 1
    understood @Parfait. The answer from vitali would be sufficient for me. – Tlatwork Sep 22 '19 at 15:31
  • 1
    @ThanksGuys, so I will add my comment as an answer then?! – Vitali Avagyan Sep 22 '19 at 15:58

2 Answers2

3

In short, if you want your return to be a list, which is your case, then use lapply instead of sapply which is a wrapper of lapply that returns a vector, matrix or array.

The same argument against vapply since it should be used, as duly mentioned in the comments, only for simplified objects.

So, the best neat solution in this case is:

docsFor <- lapply(urls, read_html)
Vitali Avagyan
  • 1,193
  • 1
  • 7
  • 17
1

Essentially, each member of the apply family by default either returns:

  • a simplified object (vector, matrix, array) where all elements are the same atomic type such as logical, integer, double, complex, raw;
  • a non-simplified object (data frame, list) where each element are not necessarily the same type and can include complex, class objects.

To adequately translate your for loop into an apply-family function you must first ask what is the input type and desired output type? Because read_html returns a special class object of XML types, it does not adequately fit an atomic vector or matrix. Therefore, lapply would be the best for loop translation here. However, its siblings could work with various changes to defaults or inputs:

lapply

lapply(urls, read_html)

apply (requires at least a 2-dimension input such as matrix or array):

apply(matrix(urls), 1, read_html)

sapply (wrapper to lapply but requires simplify argument)

sapply(urls, read_html, simplify=FALSE)

by (object-oriented wrapper to tapply)

by(urls, urls, function(x) read_html(as.character(x)))

mapply (requires SIMPLIFY argument which is equivalent to wrapper, Map)

mapply(read_html, urls, SIMPLIFY = FALSE)

Map(read_html, urls)

rapply (requires nested list transformation, with list output)

urls_list <- list(u1 = urls[1], u2 = urls[2])

rapply(urls_list, read_html, how="list")

Below functions will not work due to defaults restricted to simplified types where ? references external pointers.

sapply (default setting)

sapply(urls, read_html)

#      https://www.r-bloggers.com https://www.stackoverflow.com
# node ?                          ?                            
# doc  ?                          ?         

vapply (usually only returns simplified objects)

vapply(urls, read_html, vector(mode="list", length=2))

#      https://www.r-bloggers.com https://www.stackoverflow.com
# node ?                          ?                            
# doc  ?                          ?         

mapply (default setting)

mapply(read_html, urls)
#      https://www.r-bloggers.com https://www.stackoverflow.com
# node ?                          ?                            
# doc  ?                          ?         

rapply

rapply(urls_list, read_html)

# $u1.node
# <pointer: 0x8638eb0>

# $u1.doc
# <pointer: 0x6f79b30>

# $u2.node
# <pointer: 0x9c98930>

# $u2.doc
# <pointer: 0x9cb19a0>

See below SO post for further reading:

Grouping functions (tapply, by, aggregate) and the *apply family

Parfait
  • 104,375
  • 17
  • 94
  • 125