Correct way to get response body of XHR requests generated by a page with RStudio Chromote

Question

I'd like to use Chromote to gather the response body of the XHR calls made by a website, but I find the API a bit complex to master, especially the async pipeline.

I guess I need to first enable the Network functionality and then load the page (this can do), but then I need to:

list all XHR calls
filter them by recognizing patterns in the request URL
access the request body of the selected sources

Can someone provide any guidance or tutorial material on this regard?

UPDATE: Ok, I switched to package crrri and made a general function for the purpose. The only missing part is some logic to decide when to close the connection and return the results:

get_website_resources <- function(url, url_filter = '*', type_filter = '*') {
  library(crrri)
  library(dplyr)
  library(stringr)
  library(jsonlite)
  library(magrittr)

  chrome <- Chrome$new()
  
  out <- new.env()
  
  out$l <- list()
  
  client <- chrome$connect(callback = ~ NULL)
  
  Fetch <- client$Fetch
  Page <- client$Page
  
  Fetch$enable(patterns = list(list(urlPattern="*", requestStage="Response"))) %...>% {
    Fetch$requestPaused(callback = function(params) {
      
      if (str_detect(params$request$url, url_filter) & str_detect(params$resourceType, type_filter)) {
        
        Fetch$getResponseBody(requestId = params$requestId) %...>% {
          resp <- .
          
          if (resp$body != '') {
            if (resp$base64Encoded) resp$body = base64_dec(resp$body) %>% rawToChar()
            
            body <- list(list(
              url = params$request$url,
              response = resp
            )) %>% set_names(params$requestId)
            
            str(body)
            
            out$l <- append(out$l, body)
          }
          
        }
      }
      
      Fetch$continueRequest(requestId = params$requestId)
    })
  } %...>% {
    Page$navigate(url)
  }
  
  
  out$l
}

score 3 · Accepted Answer · answered Aug 04 '20 at 08:26

Cracked it. Here's the final function. It uses a crrri::perform_with_chrome wich force synch behaviour and run the rest of the process into a promise object with a resolve callback defined outside the promise itself which is called either if a number of resources are collected or if a certain amount of time has passed:

get_website_resources <- function(url, url_filter = '*', type_filter = '*', wait_for = 20, n_of_resources = NULL, interactive = F) {

    library(crrri)
    library(promises)

    crrri::perform_with_chrome(function(client) {
        Fetch <- client$Fetch
        Page <- client$Page

        if (interactive) client$inspect()

        out <- new.env()

        out$results <- list()
        out$resolve_function <- NULL

        out$pr <- promises::promise(function(resolve, reject) {
            out$resolve_function <- resolve

            Fetch$enable(patterns = list(list(urlPattern="*", requestStage="Response"))) %...>% {
                Fetch$requestPaused(callback = function(params) {

                    if (str_detect(params$request$url, url_filter) & str_detect(params$resourceType, type_filter)) {

                        Fetch$getResponseBody(requestId = params$requestId) %...>% {
                            resp <- .

                            if (resp$body != '') {
                                if (resp$base64Encoded) resp$body = jsonlite::base64_dec(resp$body) %>% rawToChar()

                                body <- list(list(
                                    url = params$request$url,
                                    response = resp
                                )) %>% set_names(params$requestId)

                                #str(body)

                                out$results <- append(out$results, body)

                                if (!is.null(n_of_resources) & length(out$results) >= n_of_resources) out$resolve_function(out$results)
                            }

                        }
                    }

                    Fetch$continueRequest(requestId = params$requestId)
                })
            } %...>% {
                Page$navigate(url)
            } %>% crrri::wait(wait_for) %>%
                then(~ out$resolve_function(out$results))

        })

        out$pr$then(function(x) x)
    }, timeouts = max(wait_for + 3, 30), cleaning_timeout = max(wait_for + 3, 30))
}

Very helpful, thanks for answering your own question. It helped me clean up a similar bit of code. Any reason why you chose to bypass the url filtering functionality of `Filter$enable` and reimplement it later in the code? — D. Woods, Nov 13 '20 at 03:35

Correct way to get response body of XHR requests generated by a page with RStudio Chromote

1 Answers1

Linked