4

I am looking at this great answer: https://stackoverflow.com/a/58211397/3502164.

The beginning of the solution includes:

library(httr)
library(xml2)

gr <- GET("https://nzffdms.niwa.co.nz/search")
doc <- read_html(content(gr, "text"))

xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value")

Output is constant across multiple requests:

"59243d3a2....61f8f73136118f9"

My Default way so far would have been:

doc <- read_html("https://nzffdms.niwa.co.nz/search")
xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value")

That results differs to the Output above and changes across multiple requests.

Question:

What is the difference in between:

  • read_html(url)
  • read_html(content(GET(url), "text"))

Why does it result in different values and why does only the "GET" solution Returns the csv in the linked question?

(I hope its ok to structure it in Kind of three Sub Questions).

What i tried:

Going down the Rabbit hole of function calls:

read_html
(ms <- methods("read_html"))
getAnywhere(ms[1])
xml2:::read_html
xml2:::read_html.default
#xml2:::read_html.response

read_xml
(ms <- methods("read_xml"))
getAnywhere(ms[1])

But that resulted in this Question: Find the used method for R wrapper functions

Thoughts:

  • I dont see that the get request takes any headers or Cookies, that could explain different Responses.

  • From my understanding both read_html and read_html(content(GET(.), "text")) will return XML/html.

  • Ok, here i am not sure if it makes sense to check, but because i ran out of ideas: I checked if there is some Kind of Caching going on.

Code:

with_verbose(GET("https://nzffdms.niwa.co.nz/search"))
....
<- Expires: Thu, 19 Nov 1981 08:52:00 GMT
<- Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0

--> Does not look to me like Caching might be the solution.

  • Looking at help("GET") gives an interesting section concerning a "conditional GET":

The semantics of the GET method change to a "conditional GET" if the request message includes an If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, or If-Range header field. A conditional GET method requests that the entity be transferred only under the circumstances described by the conditional header field(s). The conditional GET method is intended to reduce unnecessary network usage by allowing cached entities to be refreshed without requiring multiple requests or transferring data already held by the client.

But as far as i see with with_verbose() None of If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, or If-Range are set.

Tlatwork
  • 1,445
  • 12
  • 35
  • woah...due to proxy (which i have already `set_config`), i get error for connection timeout for `read_html("http://httpbin.org/")` but not for `read_html(GET("http://httpbin.org/"))`...seems like a bug...back to your qn, i am not 100% sure but your final thought seems reasonable and maybe keep-connection alive is set in the first but not 2nd ...see also https://stackoverflow.com/questions/5207160/what-is-a-csrf-token-what-is-its-importance-and-how-does-it-work – chinsoon12 Oct 03 '19 at 23:59

1 Answers1

6

The difference is that with repeated calls to httr::GET, the handle persists between calls. With xml2::read_html(), a new connection is made each time.

From the httr documentation:

The handle pool is used to automatically reuse Curl handles for the same scheme/host/port combination. This ensures that the http session is automatically reused, and cookies are maintained across requests to a site without user intervention.

From the xml2 documentation, discussing the string parameter that is passed to read_html():

A string can be either a path, a url or literal xml. Urls will be converted into connections either using base::url or, if installed, curl::curl

So your answer is read_html(GET(url)) is like refreshing your browser, but read_html(url) is like closing your browser and opening a new one. The server gives a unique session ID on the page it delivers. New session, new ID. You can prove this by calling httr::reset_handle(url):

library(httr)
library(xml2)

# GET the page (note xml2 handles httr responses directly, don't need content("text"))
gr <- GET("https://nzffdms.niwa.co.nz/search")
doc <- read_html(gr)
print(xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value"))

# A new GET using the same handle gets exactly the same response
gr <- GET("https://nzffdms.niwa.co.nz/search")
doc <- read_html(gr)
print(xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value"))

# Now call GET again after resetting the handle
httr::handle_reset("https://nzffdms.niwa.co.nz/search")
gr <- GET("https://nzffdms.niwa.co.nz/search")
doc <- read_html(gr)
print(xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value"))

In my case, sourcing the above code gives me:

[1] "ecd9be7c75559364a2a5568049c0313f"
[1] "ecd9be7c75559364a2a5568049c0313f"
[1] "d953ce7acc985adbf25eceb89841c713"
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87