2

This is my first time on stackexchange, facing a problem I haven't been able to find a solution to with numerous hours spent on Google. Apologies in advance for the lack of more images, stackexchange will not allow me to more photos until my reputation is higher.

I need to get results from a .do page:

http://webapps2.rrc.state.tx.us/EWA/productionQueryAction.do

The url remains static when a large query e.g

https://i.stack.imgur.com/eYrkA.png

is submitted - I cannot alter my query by changing the url.

Once submitted, it takes a long time to load, and after loading, a table of results is presented.

I then require all the data to be on one page so I select 'View All' as the page size from the drop down menu on the top left of the table to obtain the following page (which has many more rows than displayed here)

https://i.stack.imgur.com/SCCi6.png

Finally, to then import this data, I open a spreadsheet and copy-paste it all in.

Even for one county/query/submission, this process is extremely tedious, and I have to do this monthly for about 20 counties, then a further 10 districts. Due to the long load time associated with the return of each query, this is a task that easily takes 2-3 hours to complete.

As such, I am hoping someone here has some kind of solution to automate/ease the process, or even simply advice as to where I can find information to create a solution myself. If I could directly query the database import the data into a spreadsheet, my problem would be solved. The load times are irrelevant if the entire process of querying the server can be automated as it could just run in the background without user input.

I apologise for the long and non-specific question, and any response will be greatly appreciated.

Thank you very much.

Born
  • 93
  • 3
  • 9
  • create cron job that is run via cli to get the details and save it to an excel monthly – Anusha Dec 17 '13 at 11:10
  • example of [cronjob](http://www.thesitewizard.com/general/set-cron-job.shtml), [export to excel](http://stackoverflow.com/questions/15699301/export-mysql-data-to-excel-in-php) – Anusha Dec 17 '13 at 11:14
  • @Anusha: Thanks for the advice and links. Do you think such a cron job would work for a javascript/dojo source? – Born Dec 19 '13 at 10:16
  • did you try to contact the site owner and ask for a public API? another option may be to try automation via http://docs.seleniumhq.org/ - it may be difficult though because of the AJAX requests – cypherabe Dec 19 '13 at 10:41

2 Answers2

1

Well, you can actually check the network calls that are executed in order to get the CSV file. I just did that for you, but it contains a lot of parameters which you need to send in. Below are the actual parameters you need to send in:

searchArgs.orderByColumnName=
&searchArgs.initialViewArgHndlr.inputValue=County
&searchArgs.startMonthArgHndlr.inputValue=01
&searchArgs.startYearArgHndlr.inputValue=1993
&searchArgs.endMonthArgHndlr.inputValue=01
&searchArgs.endYearArgHndlr.inputValue=2013
&searchArgs.oilOrGasArgHndlr.inputValue=
&searchArgs.gasWellNoHndlr.inputValue=
&searchArgs.searchTypeHndlr.inputValue=production
&searchArgs.viewTypeHndlr.inputValue=monthlyTotals
&searchArgs.activeTabsFlagwordHndlr.inputValue=11
&searchArgs.orderByHndlr.inputValue=
&searchArgs.leaseTypeArgHndlr.inputValue=O
&searchArgs.districtCodeArgHndlr.inputValue=
&searchArgs.leaseNumberArgHndlr.inputValue=
&searchArgs.fieldNumbersArgHndlr.inputValue=
&searchArgs.fieldNamesHndlr.inputValue=
&searchArgs.operatorNumbersArgHndlr.inputValue=
&searchArgs.operatorNamesHndlr.inputValue=
&searchArgs.onShoreCountyCodeArgHndlr.inputValue=001
&searchArgs.offShoreCountyCodeArgHndlr.inputValue=
&searchArgs.leaseNameArgHndlr.inputValue=
&searchArgs.geoRegionOptionHndlr.inputValue=county
&searchArgs.statewideFlagHndlr.inputValue=
&methodToCall=generateProductionQueryReportCsv
&actionManager.recordCountHndlr.inputValue=2
&actionManager.currentIndexHndlr.inputValue=1
&actionManager.actionRcrd%5B0%5D.actionDisplayNmHndlr.inputValue=Search+Criteria
&actionManager.actionRcrd%5B0%5D.hostHndlr.inputValue=webapps2.rrc.state.tx.us%3A80
&actionManager.actionRcrd%5B0%5D.contextPathHndlr.inputValue=%2FEWA
&actionManager.actionRcrd%5B0%5D.actionHndlr.inputValue=%2FproductionQueryAction.do
&actionManager.actionRcrd%5B0%5D.actionParameterHndlr.inputValue=methodToCall
&actionManager.actionRcrd%5B0%5D.actionMethodHndlr.inputValue=unspecified
&actionManager.actionRcrd%5B0%5D.pagerParameterKeyHndlr.inputValue=
&actionManager.actionRcrd%5B0%5D.actionParametersHndlr.inputValue=
&actionManager.actionRcrd%5B0%5D.returnIndexHndlr.inputValue=0
&actionManager.actionRcrd%5B0%5D.argRcrdParameters%28searchArgs.paramValue%29=%7C1%3DCounty%7C2%3D01%7C3%3D1993%7C4%3D01%7C5%3D2013%7C8%3Dproduction%7C9%3DmonthlyTotals%7C10%3D11%7C101%3DO%7C201%3D001%7C204%3Dcounty
&actionManager.actionRcrd%5B1%5D.actionDisplayNmHndlr.inputValue=County%3A+ANDERSON
&actionManager.actionRcrd%5B1%5D.hostHndlr.inputValue=webapps2.rrc.state.tx.us%3A80
&actionManager.actionRcrd%5B1%5D.contextPathHndlr.inputValue=%2FEWA
&actionManager.actionRcrd%5B1%5D.actionHndlr.inputValue=%2FproductionQueryAction.do
&actionManager.actionRcrd%5B1%5D.actionParameterHndlr.inputValue=methodToCall
&actionManager.actionRcrd%5B1%5D.actionMethodHndlr.inputValue=search
&actionManager.actionRcrd%5B1%5D.pagerParameterKeyHndlr.inputValue=prodPager.paramValue
&actionManager.actionRcrd%5B1%5D.actionParametersHndlr.inputValue=
&actionManager.actionRcrd%5B1%5D.returnIndexHndlr.inputValue=0
&actionManager.actionRcrd%5B1%5D.argRcrdParameters%28prodPager.paramValue%29=%7C1%3D1%7C2%3D-1%7C3%3D241%7C4%3D0%7C5%3D0%7C6%3D10
&actionManager.actionRcrd%5B1%5D.argRcrdParameters%28searchArgs.paramValue%29=%7C1%3DCounty%7C2%3D01%7C3%3D1993%7C4%3D01%7C5%3D2013%7C8%3Dproduction%7C9%3DmonthlyTotals%7C10%3D11%7C101%3DO%7C201%3D001%7C204%3Dcounty
&searchArgs.startMonthArg=01
&searchArgs.startYearArg=1993
&searchArgs.endMonthArg=01
&searchArgs.endYearArg=2013
&pager.pageSize=-1

I hope for you that only the parameters prefixed with searchArgs are important (they contain the information of the form, like the month/year/....

The actionManager prefixed parameters contain some complicated data, you need to define them (else you get an error), but I don't know if they're actually used, so you might just leave them empty.

When executing this call, you will imediately receive the CSV file which you can probably parse using any language (look for a CSV parsing library) and insert it into a database or something else.


However, the best possibility is to look for an alternative to this source (or a public API you can access), because this is really crazy to work with, and it might change.


Example using PHP

The following code is an example in PHP retrieving the CSV (using cURL) and parsing it into an array.

<?php

/** The data */
$minMonth = '01';
$minYear = '1993';
$maxMonth = '01';
$maxYear = '2013';
$view = 'County';
$region = 'county';
$location = '001'; /** This is the county code of ANDERSON */

$ch = curl_init();
/** URL */
curl_setopt($ch, CURLOPT_URL, "http://webapps2.rrc.state.tx.us/EWA/productionQueryAction.do");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POST, true);

/** All these form parameters */
$data = array(
    'searchArgs.orderByColumnName' => '',
    'searchArgs.initialViewArgHndlr.inputValue' => $view,
    'searchArgs.startMonthArgHndlr.inputValue' => $minMonth,
    'searchArgs.startYearArgHndlr.inputValue' => $minYear,
    'searchArgs.endMonthArgHndlr.inputValue' => $maxMonth,
    'searchArgs.endYearArgHndlr.inputValue' => $maxYear,
    'searchArgs.oilOrGasArgHndlr.inputValue' => '',
    'searchArgs.gasWellNoHndlr.inputValue' => '',
    'searchArgs.searchTypeHndlr.inputValue' => 'production',
    'searchArgs.viewTypeHndlr.inputValue' => 'monthlyTotals',
    'searchArgs.activeTabsFlagwordHndlr.inputValue' => '11',
    'searchArgs.orderByHndlr.inputValue' => '',
    'searchArgs.leaseTypeArgHndlr.inputValue' => 'O',
    'searchArgs.districtCodeArgHndlr.inputValue' => '',
    'searchArgs.leaseNumberArgHndlr.inputValue' => '',
    'searchArgs.fieldNumbersArgHndlr.inputValue' => '',
    'searchArgs.fieldNamesHndlr.inputValue' => '',
    'searchArgs.operatorNumbersArgHndlr.inputValue' => '',
    'searchArgs.operatorNamesHndlr.inputValue' => '',
    'searchArgs.onShoreCountyCodeArgHndlr.inputValue' => $location,
    'searchArgs.offShoreCountyCodeArgHndlr.inputValue' => '',
    'searchArgs.leaseNameArgHndlr.inputValue' => '',
    'searchArgs.geoRegionOptionHndlr.inputValue' => $region,
    'searchArgs.statewideFlagHndlr.inputValue' => '',
    'methodToCall' => 'generateProductionQueryReportCsv',
    'actionManager.recordCountHndlr.inputValue' => '2',
    'actionManager.currentIndexHndlr.inputValue' => '1',
    'actionManager.actionRcrd%5B0%5D.actionDisplayNmHndlr.inputValue' => 'Search+Criteria',
    'actionManager.actionRcrd%5B0%5D.hostHndlr.inputValue' => 'webapps2.rrc.state.tx.us%3A80',
    'actionManager.actionRcrd%5B0%5D.contextPathHndlr.inputValue' => '%2FEWA',
    'actionManager.actionRcrd%5B0%5D.actionHndlr.inputValue' => '%2FproductionQueryAction.do',
    'actionManager.actionRcrd%5B0%5D.actionParameterHndlr.inputValue' => 'methodToCall',
    'actionManager.actionRcrd%5B0%5D.actionMethodHndlr.inputValue' => 'unspecified',
    'actionManager.actionRcrd%5B0%5D.pagerParameterKeyHndlr.inputValue' => '',
    'actionManager.actionRcrd%5B0%5D.actionParametersHndlr.inputValue' => '',
    'actionManager.actionRcrd%5B0%5D.returnIndexHndlr.inputValue' => '0',
    'actionManager.actionRcrd%5B0%5D.argRcrdParameters%28searchArgs.paramValue%29' => '%7C1%3DCounty%7C2%3D01%7C3%3D1993%7C4%3D01%7C5%3D2013%7C8%3Dproduction%7C9%3DmonthlyTotals%7C10%3D11%7C101%3DO%7C201%3D001%7C204%3Dcounty',
    'actionManager.actionRcrd%5B1%5D.actionDisplayNmHndlr.inputValue' => 'County%3A+ANDERSON',
    'actionManager.actionRcrd%5B1%5D.hostHndlr.inputValue' => 'webapps2.rrc.state.tx.us%3A80',
    'actionManager.actionRcrd%5B1%5D.contextPathHndlr.inputValue' => '%2FEWA',
    'actionManager.actionRcrd%5B1%5D.actionHndlr.inputValue' => '%2FproductionQueryAction.do',
    'actionManager.actionRcrd%5B1%5D.actionParameterHndlr.inputValue' => 'methodToCall',
    'actionManager.actionRcrd%5B1%5D.actionMethodHndlr.inputValue' => 'search',
    'actionManager.actionRcrd%5B1%5D.pagerParameterKeyHndlr.inputValue' => 'prodPager.paramValue',
    'actionManager.actionRcrd%5B1%5D.actionParametersHndlr.inputValue' => '',
    'actionManager.actionRcrd%5B1%5D.returnIndexHndlr.inputValue' => '0',
    'actionManager.actionRcrd%5B1%5D.argRcrdParameters%28prodPager.paramValue%29' => '%7C1%3D1%7C2%3D-1%7C3%3D241%7C4%3D0%7C5%3D0%7C6%3D10',
    'actionManager.actionRcrd%5B1%5D.argRcrdParameters%28searchArgs.paramValue%29' => '%7C1%3DCounty%7C2%3D01%7C3%3D1993%7C4%3D01%7C5%3D2013%7C8%3Dproduction%7C9%3DmonthlyTotals%7C10%3D11%7C101%3DO%7C201%3D001%7C204%3Dcounty',
    'searchArgs.startMonthArg' => $minMonth,
    'searchArgs.startYearArg' => $minYear,
    'searchArgs.endMonthArg' => $maxMonth,
    'searchArgs.endYearArg' => $maxYear,
    'pager.pageSize' => '-1'
);

curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
$output = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);

/** Convert CSV string to array */
$data = str_getcsv($output);

/** Print array */
print_r($data);

?>
g00glen00b
  • 41,995
  • 13
  • 95
  • 133
  • This looks very promising, thank you for doing this. Sorry to be a pain here but could you possibly explain how to go about executing such a network call on the website, or suggest somewhere I can learn about it? – Born Dec 19 '13 at 23:53
  • Depends in which language you want to write it... . I mean, importing the stuff into an Excel sheet is not something you should do with Dojo I think (unless you're using Dojo in combination with Node.js maybe). – g00glen00b Dec 20 '13 at 07:17
  • I'd prefer to write it in php, I think. I just need to generate a csv file, as you said. I can parse it easily or I can feed it into vba and collate it that way. But I don't know how to perform a network call like you have done here – Born Dec 20 '13 at 09:22
  • You can use the cURL module of PHP for that. For example: http://www.electrictoolbox.com/php-curl-form-post/. The `$data` array in that example must contain the parameters I listed in my answer. – g00glen00b Dec 20 '13 at 09:25
  • Thats perfect, thank you very much, you have saved me many hours! – Born Dec 22 '13 at 20:44
  • Could you explain how you checked the network calls that were executed to get the csv? I want to replicate this solution for a different database I need to scrape – Born Feb 11 '14 at 14:56
  • By checking the source code and the network requests – g00glen00b Feb 11 '14 at 16:48
  • And you can check these network requests by using the developer tools of your browser, usually by pressing `F12`. – g00glen00b Feb 11 '14 at 19:53
  • I have been able to monitor the network calls and am trying to replicate them for different districts/counties but am unable to get the data I require. I am changing the URL's and other characters used in the calls to match the encoding you have used but am missing a few from the list you gave: `'searchArgs.startMonthArg' => $minMonth, 'searchArgs.startYearArg' => $minYear, 'searchArgs.endMonthArg' => $maxMonth, 'searchArgs.endYearArg' => $maxYear, 'pager.pageSize' => '-1'` I am able to find the rest in the network calls but replicating them does not get me the data. – Born Mar 10 '14 at 18:48
0

From what you are describing, this sounds like an issue with the Database taking the time to generate the results.

If this is the case, do you know if there are any indexes on the database being queried?

If you are querying data across multiple linked tables (which it looks & sounds like you are), without any indexes, then the queries will run like a dog.

This website offers a good introduction to indexes that may be worth taking a look at.

Other than this, it may be prevalent to setup a cron to automate the process for you like Anusha has described in the comments.

I would recommend looking into indexes as well as setting up the cron jobs, as rather than just cut down the amount of time it takes for you to generate the data each month, you could fully automate it, and have it running in a few minutes, rather than hours each month.

Community
  • 1
  • 1
guyver4mk
  • 609
  • 6
  • 11
  • But the website/database I am querying is not my own, I have no access to their servers. Could cron job query the website in the way I have to and get to the final page ([link](http://i.stack.imgur.com/SCCi6.png)), then download the table? – Born Dec 17 '13 at 12:56
  • To be honest, I am not sure. I've taken a look at the source code that is displayed in a view source request from a browser and downloaded the .do file to see if there was anything that can be done from a php perspective, however it appears that the site is written in a mix of javascript and dojo which are really not my speciality :( the only thing I could suggest is possibly get in contact with the support on the website, and enquire if they are able to produce automated reports? Or if they have an API that can be used to query the site. I'm sorry I couldn't be of more assistance with this. – guyver4mk Dec 17 '13 at 23:24
  • No problem at all, thanks very much for making the effort to check through it, I appreciate it! I'll update the tags so perhaps someone with expertise in php and dojo comes across this. It's Christmas time now so very unlikely to hear back from the people who run the website – Born Dec 19 '13 at 09:53