0

I have this for loop in an R script:

url <- "https://example.com"
page <- html_session(url, config(ssl_verifypeer = FALSE))

links <- page %>% 
  html_nodes("td") %>% 
  html_nodes("tr") %>%
  html_nodes("a") %>% 
  html_attr("href")

base_names <- page %>%
  html_nodes("td") %>% 
  html_nodes("tr") %>%
  html_nodes("a") %>% 
  html_attr("href") %>%
  basename()

for(i in 1:length(links)) {

  site <- html_session(URLencode(
    paste0("https://example.com", links[i])),
    config(ssl_verifypeer = FALSE))

  writeBin(site$response$content, base_names[i])
} 

This loops through links, & downloads a text file to my working directory. I'm wondering if I can put return somewhere, so that it returns the document.

Reason being, is that I'm executing my script in NiFi (using ExecuteProcess), and it's not sending my scraped documents down the line. Instead, it just shows the head of my R script. I would assume you would wrap the for loop in a fun <- function(x) {}, but I'm not sure how to integrate the x into an already working scraper.

I need it to return documents down the flow, and not just this:

enter image description here

Processor config:

enter image description here

Even if you are not familiar with NiFi, it would be a great help on the R part! Thanks

papelr
  • 468
  • 1
  • 11
  • 42
  • 2
    Instead of a `for` loop, you can use `out <- Map(function(link, bn) { site <- ...link...; writeBin(..., bn); return(site$response$content); }, links, base_names)`. (BTW: you're missing the definition of `base_names`.) – r2evans Feb 01 '19 at 17:53
  • Woops, fixed that names mistake @r2evans, my bad. What does the `bn` refer to in your comment? – papelr Feb 01 '19 at 17:55
  • 1
    `bn` is the arbitrary name I used as the second argument in the anonymous function; `Map` "zips" together the paired elements of `links` and `base_names`, assigning the first elements each to `link` and `bn` respectively within the function. If `lapply(links, function(link) {...})` works with just links, then `Map(function(link, bn) {...}, links, base_names)` is the equivalent with both. – r2evans Feb 01 '19 at 17:59
  • 1
    So `Map(myfunc, links, base_names)` unrolls to `myfunc(links[1], base_names[1])`, then `myfunc(links[2], base_names[2])`, etc. `Map` always returns a `list`, whereas `mapply` *may* return a vector (analogous to `lapply`-vs-`sapply`). – r2evans Feb 01 '19 at 18:01
  • I'm attempting to adapt your first comment (thanks for the logic of it, btw). I'm assuming I should get rid of the semicolons? Apologies for the dumb question. Not as familiar with `Map` as I should be. It's hard to picture with comment formatting. If you'd like to put in an answer, that would be helpful – papelr Feb 01 '19 at 18:06

1 Answers1

2

If your intent is to both (1) save the output (with writeBin) and (2) return the values (in a list), then try this:

out <- Map(function(ln, bn) {
  site <- html_session(URLencode(
    paste0("https://example.com", ln)),
    config(ssl_verifypeer = FALSE))
  writeBin(site$response$content, bn)
  site$response$content
}, links, base_names)

The use of Map "zips" together the individual elements. For a base-case, the following are identical:

Map(myfunc, list1)
lapply(list1, myfunc)

But if you want to use same-index elements from multiple lists, you can do one of

lapply(seq_len(length(list1)), function(i) myfunc(list1[i], list2[i], list3[i]))
Map(myfunc, list1, list2, list3)

where unrolling Map results effectively in:

myfunc(list1[1], list2[1], list3[1])
myfunc(list1[2], list2[2], list3[2])
# ...

The biggest difference between lapply and Map here is that lapply can only accept one vector, whereas Map accepts one or more (practically unlimited), zipping them together. All of the lists used must be the same length or length 1 (recycled), so it's legitimate to do something like

Map(myfunc, list1, list2, "constant string")

Note: Map-versus-mapply is similar to lapply-vs-sapply. For both, the first always returns a list object, while the second will return a vector IFF every return value is of the same length/dimension, otherwise it too will return a list.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    No `return` before the last `site$response$content` in this case? Otherwise, I think this is a great showcase of what to use instead of a loop. Where can I send you a beer? – papelr Feb 01 '19 at 18:54
  • `return` is implied here ... use `return(site$response$content)` if you prefer – r2evans Feb 01 '19 at 18:58
  • Got it.. if this returns a bunch of numbers, instead of document (scraping `.txt` files), is that wrong? (I really will send you beer money, I appreciate your patience) – papelr Feb 01 '19 at 19:06
  • is the return value different from the files created by `writeBin`? – r2evans Feb 01 '19 at 19:17
  • 1
    Actually figured it out... it was dumping files in a random NiFi directory within the container (only could be seen through terminal). What I'm doing isn't ideal...but the client wants... so ya know – papelr Feb 02 '19 at 18:28