1

There are 2 parts of my questions as I explored 2 methods in this exercise, however I succeed in none. Greatly appreciated if someone can help me out.

[PART 1:]

I am attempting to scrape data from a webpage on Singapore Stock Exchange https://www2.sgx.com/derivatives/negotiated-large-trade containing data stored in a table. I have some basic knowledge of scraping data using (rvest). However, using Inspector on chrome, the html hierarchy is much complex then I expected. I'm able to see that the data I want is hidden under < div class= "table-container" >,and here's what I've tied:

library(rvest)
library(httr)
library(XML)
SGXurl <- "https://www2.sgx.com/derivatives/negotiated-large-trade"
SGXdata <- read_html(SGXurl, stringsASfactors = FALSE)
html_nodes(SGXdata,".table-container")

However, nothing has been picked up by the code and I'm doubt if I'm using these code correctly.

[PART 2:]

As I realize that there's a small "download" button on the page which can download exactly the data file i want in .csv format. So i was thinking to write some code to mimic the download button and I found this question Using R to "click" a download file button on a webpage, but i'm unable to get it to work with some modifications to that code.

There's a few filtera on the webpage, mostly I will be interested downloading data for a particular business day while leave other filters blank, so i've try writing the following function:

library(httr)
library(rvest)
library(purrr)
library(dplyr)

crawlSGXdata = function(date){

POST("https://www2.sgx.com/derivatives/negotiated-large-trade", 
     body = NULL
     encode = "form",
     write_disk("SGXdata.csv")) -> resfile
res = read.csv(resfile)
return(res) 
}

I was intended to put the function input "date" into the “body” argument, however i was unable to figure out how to do that, so I started off with "body = NULL" by assuming it doesn't do any filtering. However, the result is still unsatisfactory. The file download is basically empty with the following error:

Request Rejected
The requested URL was rejected. Please consult with your administrator.
Your support ID is: 16783946804070790400
QHarr
  • 83,427
  • 12
  • 54
  • 101
Jeremy
  • 379
  • 2
  • 11

1 Answers1

3

The content is loaded dynamically from an API call returning json. You can find this in the network tab via dev tools.

The following returns that content. I find the total number of pages of results and loop combining the dataframe returned from each call into one final dataframe containing all results.

library(jsonlite)

url <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=0&pageSize=250'
r <-  jsonlite::fromJSON(url)
num_pages <- r$meta$totalPages
df <- r$data
url2 <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=placeholder&pageSize=250'

if(num_pages > 1){
  for(i in seq(1, num_pages)){
    newUrl <- gsub("placeholder", i , url2)
    newdf <- jsonlite::fromJSON(newUrl)$data
    df <- rbind(df, newdf)
  }
}
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • this codes looks pretty alien to me and i'm not really sure what JSON is, do u have any suggestion where should i start learning those stuff? thanks – Jeremy Jul 09 '19 at 01:12
  • hi QHarr, I realize this code is only able to crawl the top 250 rows of data even i change the "pageSize =250" to a larger number, do you have any quick solution on these ? thank you! – Jeremy Jul 09 '19 at 03:01
  • [This](https://en.wikipedia.org/wiki/JSON) explains json. It is basically a file format. And the obligatory link: https://www.json.org/ – QHarr Jul 09 '19 at 17:25
  • Hi QHarr, thanks for you comments and I would definitely accept your answer. Sorry I'm a bit slow here as i'm new to this site. Probably just one last question, how did u manage to find out the api link "https://api.sgx.com/negotiatedlargetrades/v1.0?....." ? As I have the problem to crop data from other page: https://www2.sgx.com/derivatives/delayed-prices-futures?cc=FEF&category=iron-ore . And i would very much like to know how i can find out the API link by myself. Thanks once again – Jeremy Jul 18 '19 at 02:49
  • Please see my desciptions [here](https://stackoverflow.com/a/56279841/6241235) and [here](https://stackoverflow.com/a/56924071/6241235) – QHarr Jul 18 '19 at 05:02