2

For a while now, I have been using R and the package RCurl to automatically download information from a webpage; I normally use simple functions like getURL(), getForm() and postForm(). I usually just find the HTML parameters optional values and fill them. However, I came across a webpage which I think cannot be downloaded using the traditional functions because I cannot find any parameters in the url address. I believe this is happening because the webpage is written in javascript and I don't know how to deal with it. I am a mathematician with vast experience using R but with a very basic knowledge of HTML and no knowledge at all of javascript.

I don't necessarily need to use R directly, I could use other software and then import it from R. I have found a Mozilla application called mozrepl but I was unable to make it work. I would appreciate if someone with more experience could help me with a solution, whether using different software or putting the appropriate commands in R or mozrepl. If it is not possible to download the info directly to an R variable it would be ok to save it to a text file.

The information I want to download is produced after selecting a date value in the following url and then hitting the button called "Consultar TIIE". A table is produced with the variables "Posturas", "Montos" and "Participantes".

http://www.banxico.org.mx/tiieban/leeArgumentos.faces?BMXC_plazo=28&BMXC_semanas=4

I am doing this because my final objective is to put the information together into a dataframe.

2 Answers2

1

There is no issue with javascript here. The javascript simple creates the calender so you can pick your date to submit to the form. There is however an issue with alot else.

On a server side it seems like they are trying to detect none browser attempts to pull the data. Also they have a redirect once the form is correctly submitted which is causing an issue.

require(RCurl)
require(XML)

appDate <- "20130502"
rURL <- "http://www.banxico.org.mx/tiieban/leeArgumentos.faces?BMXC_plazo=28&BMXC_semanas=4"
usera <- "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:21.0) Gecko/20100101 Firefox/21.0"
curl <- getCurlHandle(cookiefile = "", verbose = TRUE, useragent = usera
                      , followlocation = TRUE, autoreferer = TRUE, postredir = 2
                      , httpheader = c(Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
                                       "Accept-Encoding" = "gzip, deflate"
                                       , "Accept-Language" =  "en-US,en;q=0.5"
                                       , Connection = "keep-alive"), referer = "http://www.banxico.org.mx/tiieban/leeArgumentos.faces")

txt <- getURLContent(rURL, curl = curl, verbose = TRUE)
fParams <- structure(c(appDate, "Consultar+TIIE", "leeArgumentos")
                     ,.Names = c( "leeArgumentos%3Afecha", "leeArgumentos%3Asubmit", "leeArgumentos"))

res <- postForm(rURL, .params = fParams, style = "post", curl = curl, binary = TRUE)
xRes <- htmlParse(rawToChar(res))
readHTMLTable(getNodeSet(xRes, "//*/table")[[3]])

  Posturas Montos                      Participantes
1   4.3100    350 Banco Credit Suisse (México), S.A.
2   4.3245    350                 Banco Inbursa S.A.
3   4.3200    350                   Banco Invex S.A.
4   4.3375    350     Banco Mercantil del Norte S.A.
5   4.3350    350      Banco Nacional de México S.A.
6   4.3250    350                   HSBC México S.A.
7   4.3300    350          ScotiaBank Inverlat, S.A.

There many things going on. The parameters for the form need encoding. leeArgumentos:fecha needs to be leeArgumentos%3Afecha for example. A user agent is probably being detected as are referrer strings and various other headers.

user1609452
  • 4,406
  • 1
  • 15
  • 20
0

This does look like a javascript problem, rather than something directly related to web-scraping in R.

There are a variety of approaches to this issue, you might take a look at Scraping Javascript generated data and also the suggestions at Language for web scraping JAVASCRIPT content

The example you point to appears to run a custom script, show_calendar2, defined here http://www.banxico.org.mx/tiieban/scripts/ts_picker2.js

Community
  • 1
  • 1
cboettig
  • 12,377
  • 13
  • 70
  • 113