0

I am trying to scrape HTML or JSON file in a site which references economists through the world. Here is an exemple of the page I am trying to exploit : https://ideas.repec.org/f/pan296.html

More accurately, I am trying to scrape the data shown when clicking on the button "Export references", in JSON, HTML or whatever. Here is what I do :

  test <- rvest::html_session("https://ideas.repec.org/f/pan296.html") %>% jump_to("https://ideas.repec.org/cgi-bin/refs.cgi")
  test$response

The connexion works well, but the output is empty :

Response [https://ideas.repec.org/cgi-bin/refs.cgi]
  Date: 2020-07-13 08:50
  Status: 200
  Content-Type: text/plain; charset=utf-8
<EMPTY BODY>

Any idea ?

Levon Ipdjian
  • 786
  • 7
  • 14
  • 1
    That is a form that uses `HTTP POST` method. You cannot use `jump_to`. – Aziz Jul 13 '20 at 09:13
  • Thank you for your answer. Do you have an idea of what I should do then ? Something with httr::POST ? – Levon Ipdjian Jul 13 '20 at 09:46
  • 1
    Usually, `POST` requests include a "request body" (where data is sent to the server). You can find the request content using the Network tab of Developer Tools in Firefox or Chrome. Then, you can reconstruct the same request in R and send it using `rvest:::request_POST` or `httr::POST` (they're the same). Search SO for "rvest post" and you will find many examples. – Aziz Jul 13 '20 at 09:54
  • scrap means throw away. You mean scrape – DisappointedByUnaccountableMod Feb 25 '21 at 20:09
  • Yes, thank you lol – Levon Ipdjian Feb 26 '21 at 10:37

1 Answers1

2

As Aziz said, you have to observe the POST request to reconstruct it. But in this situation, the work can be tricky since the request in the new tab. Follow this topic to see how you can observe the request open in new tab: Chrome Dev Tools: How to trace network for a link that opens a new tab?

The code to get the export content:

library(rvest)

url <- "https://ideas.repec.org/f/pan296.html"
pg <- html_session(url)
handle_value <- pg %>% html_node(xpath = "//form/input[@name='handle']") %>% html_attr("value")
pg <- pg %>% rvest:::request_POST(url = "https://ideas.repec.org/cgi-bin/refs.cgi",
                                  body = list("handle"= handle_value,
                                              "ref" = "Export references ",
                                              "output" = "0"))

pg$response

(Change the output number value to get different output format, 0 is for HTML)

enter image description here

xwhitelight
  • 1,569
  • 1
  • 10
  • 19