0

I'm trying to scrape a poorly designed governmental website which uses POST requests triggered from JavaScript for navigation (I'm trying to navigate the calendar).

I'm trying to do this the elegant way, with jsdom and jQuery in node (and possibly jsdom-simulant), but I'm not sure I understand how I'm supposed to fire the events within the simulator, given that the events themselves are supposed to go back to jsdom and trigger a new HTTP POST request.

I don't expect you guys to write the code for me, I only need a couple of pointers as to the structure, philosophy or an existing code base which does something similar.

Bogdan Stăncescu
  • 5,320
  • 3
  • 24
  • 25

1 Answers1

1

Regarding the scraping part, this is a POST request sending form url encoded data. There are 2 fields which seem necessary in the payload :

  • __EVENTTARGET=ctl00$B_Center$VoturiPlen1$calVOT
  • __EVENTARGUMENT=XXXX (with XXXX some value)

The __EVENTARGUMENT value is incrementing each days. For instance on 04/04/2018 it's 6668, on 05/04/2018 it would be 6669. Looking at the oldest date which is 01/01/1998, the index is -730, so this index can be calculated using the difference in days between the target date and 01/01/1998 minus 730

Using & and dateutils :

target_date="2018-04-04"
index=$(($(dateutils.ddiff 1998-01-01 "$target_date") - 730))

curl 'https://www.senat.ro/Voturiplen.aspx' \
     -H 'User-Agent: Mozilla' \
     -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' \
     --data "__EVENTTARGET=ctl00%24B_Center%24VoturiPlen1%24calVOT&__EVENTARGUMENT=$index"

And using html parser :

curl 'https://www.senat.ro/Voturiplen.aspx' \
     -H 'User-Agent: Mozilla' \
     -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' \
     --data "__EVENTTARGET=ctl00%24B_Center%24VoturiPlen1%24calVOT&__EVENTARGUMENT=$index" | \
     pup 'table#ctl00_B_Center_VoturiPlen1_GridVoturi'

Using you can use , :

const request = require('request');
const moment = require('moment');
const jsdom = require("jsdom");
const {JSDOM} = jsdom;

var a = moment('21/12/2017', 'DD/MM/YYYY');
var b = moment('01/01/1998', 'DD/MM/YYYY');
var index = a.diff(b, 'days') - 730;

request.post({
    url: 'https://www.senat.ro/Voturiplen.aspx',
    form: {
        "__EVENTTARGET": "ctl00$B_Center$VoturiPlen1$calVOT",
        "__EVENTARGUMENT": index
    },
    headers: {
        'User-Agent': 'Mozilla'
    }
},
function(err, httpResponse, body) {
    const dom = new JSDOM(body);
    var table = dom.window.document.querySelector("#ctl00_B_Center_VoturiPlen1_GridVoturi");
    console.log(table.textContent);
});

check this post for date diff with

Bertrand Martel
  • 42,756
  • 16
  • 135
  • 159
  • Wow, I'm totally impressed, you really dived into this thing! Thank you, that's an amazing answer! I was kind of hoping there would be a reasonably easy way to put all the bricks together and get essentially a headless browser, because that would've allowed me to scrape any web page, regardless of its Byzantine navigation logic; if I find no other solution, I will have to go along with a solution similar to what you described. Thank you again for taking the time to dig so deep into this! – Bogdan Stăncescu Jul 20 '18 at 10:03
  • I'm not sure but I think this will be more complex to use dom manipulation (simulate clicks on link) here since you have to click on the year and on the month before clicking the day (except if you are already in the right month & year). So you will have to deal with states (eg click the year link, wait for the page load, click the month link, wait for the page load, click the day link and finally parse the result table) which will complexify the algorithm. Note that I may be wrong, just some thoughts I had this afternoon – Bertrand Martel Jul 20 '18 at 17:16
  • I ended up implementing your solution in C#, where I'm more comfortable these days – I was planning on using node because of its tentative ability to execute javascript natively, but since that fell through... For the record, the entire approach has to be stateful anyway, because the implementation in ASP.Net is totally moronic, it uses the very fragile (and unRESTful) [VIEWSTATE](https://www.google.com/search?q=asp.net+viewstate) variable to describe the entire state of the interface, including pagination. [AngleSharp](https://anglesharp.github.io/) to the rescue. – Bogdan Stăncescu Jul 21 '18 at 16:39