0

I am new to web scraping. I would like to pull out data from this website: https://bpstat.bportugal.pt/dados/explorer

I have managed to get a response using the GET() function (even though not positive every time I run my code) using httr package.

library(httr)
URL <- "https://bpstat.bportugal.pt/dados/explorer"
r <- GET(URL)
r
Response [https://bpstat.bportugal.pt/dados/explorer]
  Date: 2020-04-09 22:25
  Status: 200
  Content-Type: text/html; charset=utf-8
  Size: 3.36 kB

I would like to send a request with these info that I would provide manually:

  • Accept the cookies on the first page

  • In the top right corner, select EN for English

  • Filter by domains – External statistics – Balance of payments

  • External operations - Balance of payments – current and capital accounts – current account – Goods and services account (highlight the following selection) :

  • Goods account; Services account; Manufacturing services on physical inputs; Maintenance and repair services; Transport services; Travel; Construction services; Insurance and pension services; Financial services; Charges for the use of intellectual property; Telecommunication, computer & information services; Other services provided by companies; Personal, cultural and recreational services; Government goods and services

  • Counterparty territory: All countries

  • Data type: Credit; Debit

  • Periodicity: Monthly

  • Unit of Measure: Millions of Euros

  • Select all series (click on them so they are highlighted in dark blue. At the top of the page click on the "Selected members" and then "go to associated series")

  • Go to Associated Series (increase number to be viewed on page at bottom of the screen. Increase from 10 to 50)

  • Manually tick all boxes except for "seasonally adjusted"

  • Go to "Selection list" Select "See in Table"

  • Download Excel three vertical dots at top ("visible data only")

I have seen a couple of examples like: - Send a POST request using httr R package but I don't know what inputs I need to provide...

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • This looks like a job for [`rvest`](https://cran.r-project.org/web/packages/rvest/index.html). It is significantly better equipped for dealing with the tasks you've identified: select something look in a table, download something. – r2evans Apr 09 '20 at 23:06
  • I actually use this package when it comes to read html pages and retrieve URLS of existing files. However, what I need is to query the page https://bpstat.bportugal.pt/dados/explorer. Then, I want to read the data or download the associated Excel file. I do not know how to do it with rvest, nor with httr. – The-Dancing-Machine-Learning Apr 10 '20 at 10:52
  • I have tried to use rvest fonctionalities using this https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/ as an example but it seems that it does not work in my case – The-Dancing-Machine-Learning Apr 10 '20 at 11:42
  • I think you may need something like [`RSelenium`](https://cran.r-project.org/web/packages/RSelenium/index.html) if you want to automate selection of series, to be honest. There's enough javascript and other fancy things there that I don't think that `rvest` will suffice. Having said that, once you know all of the `series_id`s that you need, you can just form the URL like this (I downloaded a CSV with a half dozen selected): https://bpstat.bportugal.pt/api/observations/csv/?series_ids=12509268,12510231,12510153,12514786,12509543,12512606,12509469&language=EN, perhaps that can be repeated? – r2evans Apr 10 '20 at 14:56
  • 1
    Thank you, I did not realized the url of the CSV contained the required information! – The-Dancing-Machine-Learning Apr 10 '20 at 22:49
  • It's a common mistake (I've made it) to think that fancy interfaces like that always require the use of `RSelenium`. Sometimes that's correct, but if you know how/where to look, often you find these shortcuts. – r2evans Apr 10 '20 at 23:30

1 Answers1

1

That website has a documented API which you can use to pull data instead of trying to scrape the pages at https://bpstat.bportugal.pt/data/docs/

The outputs are JSON-stat, and you can use https://github.com/ajschumacher/rjstat to make them easier to handle.