0

There is a PHP form which queries a massive database. The URL for the form is https://db.slickbox.net/venues.php. It takes up to 10 minutes after the form is sent for results to be returned, and the results are returned inline on the same page. I've tried using Requests, URLLib2, LXML, and Selenium but I cannot come up with a solution using any of these libraries. Does anyone know of a way to retrieve the page source of the results after submitting this form?

If you know of a solution for this, for the sake of testing just fill out the name field ("vname") with the name of any store/gas station that comes to mind. Ultimately, I need to also set the checkboxes with the "checked" attribute but that's a subsequent goal after I get this working. Thank you!

DaJoNel
  • 5
  • 1
  • Did you try to change the timeout of your request? – Dekel Aug 04 '16 at 02:21
  • In the case of the Requests library, anyway, that's not how timeout works. That sets the maximum time it'll wait before returning an exception. The problem is running the code with any of these libraries returns a result immediately which it should not. – DaJoNel Aug 04 '16 at 02:41

2 Answers2

0

I usually rely on Curl to do these kind of thing. Instead of sending the form with the button to retrieve the source, call directly the response page (giving it your request). As i work under PHP, it's quite easy to do this. With python, you will need pycURL to manage the same thing.

So the only thing to do is to call venues.php with the good arguments values thrown using POST method with Curl.

This way, you will need to prepare your request (country code, cat name), but you won't need to check the checkbox nor load the website page on your browser.

set_ini(max_execution_time,1200) // wait 20 minutes before quitting
$ch = curl_init();

// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "https://db.slickbox.net/venues.php");
curl_setopt($ch, CURLOPT_HEADER, 0);

// prepare arguments for the form
$data = array('adlock   ' => 1, 'age' => 0,'country' => 145,'imgcnt'=>0, 'lock'=>0,'regex'=>1,'submit'=>'Search','vname'=>'test');

//add arguments to our request
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
//launch request
if( ! $result = curl_exec($ch))
{
    trigger_error(curl_error($ch));
} 
echo $result;
Community
  • 1
  • 1
technico
  • 1,192
  • 1
  • 12
  • 22
  • I'm fine just using PHP for this, as I could build it into the web interface I'm planning to use, anyway. I'm not terribly familiar with Curl, though, so can you provide a more tangible example of code I would use? If not, that's fine and tomorrow I can do more research on it. Will Curl not care that it may have to wait 10 minutes for a response? Thanks for the tip! – DaJoNel Aug 04 '16 at 02:37
  • Added untested sample code. About time limit, curl may be limited by php max execution time. We can change it only for our script with ini_set function. – technico Aug 04 '16 at 02:55
  • That worked; thank you very much! I've always appreciated PHP and this further solidifies how awesome it is. I'll definitely look into PycURL as well (easier automation using Python for my purposes), but this is a great start! – DaJoNel Aug 04 '16 at 03:36
  • You could also use php file as local script, calling them with "php" command, in this case, there is no more max execution time limit to care about :) – technico Aug 04 '16 at 04:03
0

How about ghost?

from ghost import Ghost
ghost = Ghost()

with ghost.start() as session:
    page, extra_resources = session.open("https://db.slickbox.net/venues.php", wait_onload_event=True)
    ghost.set_field_value("input[name=vname]", "....")
    # Any other values
    page.fire_on('form', 'submit')
    page, resources = ghost.wait_for_page_loaded()

    content = session.content # or page.content I forgot which

After you can use beautifulsoup to parse the HTML or Ghost may have some rudimentary utilities to do that.

bhuvy
  • 304
  • 2
  • 7