0

I need to fetch data programmatically from JSF sites.

Here's an example: https://dataminer.pjm.com/dataminerui/pages/public/lmp.jsf

To get data, enter any Start Date and End Date and click on Export CSV on top right. (It generates a fair amount of data, so pick a 1-day range.)

In the Network tab of Chrome, I see the following request headers and form data:

Request Headers
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip, deflate
Accept-Language:en-US,en;q=0.8,ko;q=0.6,zh;q=0.4
Cache-Control:max-age=0
Connection:keep-alive
Content-Length:425
Content-Type:application/x-www-form-urlencoded
Cookie:JSESSIONID=gixQBXBESRofyqLpiH2hlYg8; dataminer=1369707692.36895.0000; __utma=109610308.1662709339.1456530705.1456530705.1456530705.1; __utmc=109610308; __utmz=109610308.1456530705.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); JSESSIONID=8sx6CTIQhpPAAO5+4xcGGGlb; WT_FPC=id=xxx.xxx.xxx.xx-3069233008.30503152:lv=1456533141859:ss=1456530705581
Host:dataminer.pjm.com
Origin:https://dataminer.pjm.com
Referer:https://dataminer.pjm.com/dataminerui/pages/public/lmp.jsf
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36

Form Data
frmCriteria:frmCriteria
frmCriteria:calStartDate_input:01/01/2016
frmCriteria:calStopDate_input:01/02/2016
frmCriteria:mnuMarket_input:REALTIME
frmCriteria:mnuMarket_focus:
frmCriteria:mnuFreq_input:Daily
frmCriteria:mnuFreq_focus:
frmCriteria:mnuPnodes_input:All
frmCriteria:mnuPnodes_focus:
javax.faces.ViewState:8578362602192686517:-1021667131748875106
frmCriteria:j_idt78:frmCriteria:j_idt78

I see all my form data in this request. It seems like I should be able to programmatically download this CSV by submitting the right request (using Python's request library).

I've tried lots of ways of regenerating this header and form data, but can't seem to produce the CSV download.

Edit: I've tried the following. I know very little about the structure of HTTP requests and responses, and cookies, so this could be comically bad. I get a 500 on the POST.

import requests


headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'en-US,en;q=0.8,ko;q=0.6,zh;q=0.4',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
    'Content-Length': 425,
    'Content-Type': 'application/x-www-form-urlencoded',
    'Host': 'dataminer.pjm.com',
    'Origin': 'https://dataminer.pjm.com',
    'Referer': 'https://dataminer.pjm.com/dataminerui/pages/public/lmp.jsf',
    'Upgrade-Insecure-Requests': 1,
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36'
}


data = {
    'frmCriteria': 'frmCriteria',
    'frmCriteria': 'calStartDate_input:01/01/2016',
    'frmCriteria': 'calStopDate_input:01/02/2016',
    'frmCriteria': 'mnuMarket_input:REALTIME',
    'frmCriteria': 'mnuMarket_focus:',
    'frmCriteria': 'mnuFreq_input:Daily',
    'frmCriteria': 'mnuFreq_focus:',
    'frmCriteria': 'mnuPnodes_input:All',
    'frmCriteria': 'mnuPnodes_focus:',
    'javax.faces.ViewState': '8578362602192686517:-1021667131748875106',
    'frmCriteria:j_idt78': 'frmCriteria:j_idt78'
}


url = 'https://dataminer.pjm.com/dataminerui/pages/public/lmp.jsf'


with requests.Session() as s:
    get_response = s.get(url)
    post_response = s.post(url, headers=headers, data=data)

How can I use the requests library to fetch the CSV?

capitalistcuttle
  • 1,709
  • 2
  • 20
  • 28
  • @KlausD. Just added my code. It's a very simplistic attempt. – capitalistcuttle Feb 27 '16 at 04:02
  • The duplicate answers the technical problem (just maintain the HTTP session and don't hardcode IDs and `ViewState`, they are not reusable across requests/sessions). As to the functional requirement, you'd better ask the site owner/admin if there isn't a webservice API available for the task you're trying to accomplish. A decent Java EE website has next to JSF for HTML frontend also JAX-RS for REST frontend. – BalusC Feb 27 '16 at 09:20

1 Answers1

3

You may well not be able unless you go through everything that leads up to the page in question. JSF pages tend to store a lot of state within the web session, so you may simply be able to POST some static payload (like you're doing) and expect it to work.

A perfect example is that ViewState parameter. That value may very well change every single time, so the value you're using could be completely invalid.

So, instead of going to straight to whatever request you're trying to do, you may well have to "walk the pages" that got you there.

Track all of the requests it takes to get there, see what changes from step to step and session to session, and see if you can work out the minimum number of steps (ideally just 1 or 2) to pull it off.

Will Hartung
  • 115,893
  • 19
  • 128
  • 203
  • Thanks, I'll try that. Is the way I'm reverse-engineering the header and form data look sane, though? Particularly things like 'frmCriteria:j_idt78': 'frmCriteria:j_idt78, where there's a colon inside the field name.' – capitalistcuttle Feb 27 '16 at 04:27