0

I'm working on a project with the World Bank analyzing their procurement processes.

The WB maintains websites for each of their projects, containing links and data for the associated contracts issued (example). Contract-related data is available under the procurement tab.

I'd like to be able to pull a project's contract information from this site, but the links and associated data are generated using embedded Javascript, and the URLs of the pages displaying contract awards and other data don't seem to follow a discernable schema (example).

Is there any way I can scrape the browser rendered data in the first example through R?

MrT
  • 704
  • 1
  • 8
  • 21
  • 1
    Sorry, I'm a bit confused. How do you get from the first link shared to the second link? And, I assume that it is the data at the second link you want to scrape, right? I'm not clear on the actual question you are trying to ask here. – A5C1D2H2I1M1N2O1R2T1 Mar 11 '13 at 04:24
  • To get to the contract award (2nd link) from the project page (1st link), you need to go to the PROCUREMENT tab and then to the Contract Awards submenu. The contract award example is the first entry in the table. I already have something to scrape the data on the 2nd link; what I'm looking for is a way to find the 2nd link from the 1st (this is the Javascript generated bit). – MrT Mar 11 '13 at 14:44

2 Answers2

5

The main page calls a javascript function

javascript:callTabContent('p','P090644','','en','procurement','procurementId');

The main thing here is the project id P090644. This together with the required language en are passed as parameters to a form at http://www.worldbank.org/p2e/procurement.html.

This form call can be replicated with a url http://www.worldbank.org/p2e/procurement.html?lang=en&projId=P090644.

Code to extract relevant project description urls follows:

projID<-"P090644"
projDetails<-paste0("http://www.worldbank.org/p2e/procurement.html?lang=en&projId=",projID)

require(XML)

pdData<-htmlParse(projDetails)
pdDescribtions<-xpathSApply(pdData,'//*/table[@id="contractawards"]//*/@href')

#> pdDescribtions
                                                                href 
#"http://search.worldbank.org/wcontractawards/procdetails/OP00005718" 
                                                                href 
#"http://search.worldbank.org/wcontractawards/procdetails/OP00005702" 
                                                                href 
#"http://search.worldbank.org/wcontractawards/procdetails/OP00005709" 
                                                                href 
#"http://search.worldbank.org/wcontractawards/procdetails/OP00005715" 

it should be noted that excel links are provided which maybe of use to you also. They may contain the data you intend to scrap from the description links

procNotice<-paste0("http://search.worldbank.org/wprocnotices/projectdetails/",projID,".xls")
conAward<-paste0("http://search.worldbank.org/wcontractawards/projectdetails/",projID,".xls")
conData<-paste0("http://search.worldbank.org/wcontractdata/projectdetails/",projID,".xls")

require(gdata)

pnData<-read.xls(procNotice)
caData<-read.xls(conAward)
cdData<-read.xls(conData)

UPDATE:

To find what is being posted we can examine what happens when the javascript function is called. Using Firebug or something similar we intercept the request header which starts:

POST /p2e/procurement.html HTTP/1.1
Host: www.worldbank.org

and has parameters:

lang=en
projId=P090644

Alternatively we can examine the javascript at http://siteresources.worldbank.org/cached/extapps/cver116/p2e/js/script.js and look at the function callTabContent:

function callTabContent(tabparam, projIdParam, contextPath, langCd, htmlId, anchorTagId) {
    if (tabparam == 'n' || tabparam == 'h') {
        $.ajax( {
            type : "POST",
            url : contextPath + "/p2e/"+htmlId+".html",
            data : "projId=" + projIdParam + "&lang=" + langCd,
            success : function(msg) {
                if(tabparam=="n"){
                    $("#newsfeed").replaceWith(msg);
                } else{
                    $("#cycle").replaceWith(msg);
                }
                stickNotes();
            }
        });
    } else {
        $.ajax( {
            type : "POST",
            url : contextPath + "/p2e/"+htmlId+".html",
            data : "projId=" + projIdParam + "&lang=" + langCd,
            success : function(msg) {
                $("#tabContent").replaceWith(msg);
                $('#map_container').hide();
                changeAlternateColors();
                $("#tab_menu a").removeClass("selected");
                $('#'+anchorTagId).addClass("selected");                
                stickNotes();
            }
        });
    }
}

examining the content of the function we can see it is simply posting relevant parameters to a form then updating the webpage.

user1609452
  • 4,406
  • 1
  • 15
  • 20
  • Could you edit your answer to explain how you figured out how to replicate the form in HTML? I can see that being useful in the future. – MrT Mar 12 '13 at 14:29
-1

I am not sure I have understood every details of your problem. But what I know for sure is that casperJS works great for javascript generated content.

You can have a look at it here: http://casperjs.org/

It's written in Javascript and has a bunch of useful functions very well documented on the link I provided.

I have used it myself lately for a personal project and can be set up easily with a few lines of code.

Give it a go! Hope, that helps..