2

I wrote a PHP script that makes HTTP POST request using curl and does the following,

  • Prepare post variables
  • Initialize curl
  • Set client cookie to use in request
  • Set POST variables as query string
  • Set other curl options
  • Execute curl

Here is the code:

    $ch = curl_init ( $url );

    curl_setopt ( $ch, CURLOPT_COOKIE, "cookie=cookie");
    curl_setopt ( $ch, CURLOPT_POST, 1);
    curl_setopt ( $ch, CURLOPT_POSTFIELDS, $post_string);
    curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt ( $ch, CURLOPT_HEADER, 0);
    curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, 1);

    $response = curl_exec( $ch );
    // this point
    extr ( $response, $param_1, $param_2);

Problem is, sometimes the response is larger than 1GB, so the PHP code pauses until, full response is received (shown in code as // this point), and if there is malformed HTML receive, PHP generates error so, all thing here needs to do from beginning.

Here is rest of the functions:

function extr($string = '',$a,$b)
{
    $doc = new DOMDocument;
    @$doc -> loadHTML($string);
    $table = $doc -> getElementById('myTableId');

    if(is_object($table)):
    foreach ($table->getElementsByTagName('tr') as $record)
    {
        $rec = array();
        foreach ($record->getElementsByTagName('td') as $data)
        {
            $rec[] = $data -> nodeValue;
        }
        if ($rec)
        {
            put_data($rec);
        }
    }
    else:
    {
        echo 'Skipped: Param1:'.$a.'-- Param2: '.$b.'<br>';
    }
    endif;
}

function put_data($one = array())
{
    $one = json_encode($one) . "\n";
    file_put_contents("data.json", $one, FILE_APPEND);
}

ini_set('max_execution_time', 3000000);
ini_set('memory_limit', '-1');

The alternative i can think of is process data as it received, if possible , using curl, or continue previous curl request from the previous state.

Is there any possible workaround for this?

Do i need to switch to any other language than PHP for this ?

viral
  • 3,724
  • 1
  • 18
  • 32
  • 1
    you could use byte serving and request only smaller chunks of the whole file, but since you're then loading those chunks into DOM, you'd need the entire file anyways. dom will barf if you feed in an incomplete document. – Marc B May 19 '15 at 19:48
  • What are you requesting, that might exceed sizes of 1GB? (Sure that HTTP is the right protocol for that?) – CBroe May 19 '15 at 19:55
  • i am looking for security flaws in a web service and found one, that leaks analytics data via HTTP POST, i just need to collect them in MySQL table – viral May 19 '15 at 20:01

2 Answers2

5

You can process the data in chunks as they come using CURLOPT_WRITEFUNCTION option with a callback:

curl_setopt($ch, CURLOPT_WRITEFUNCTION, function(&$ch, $data) {
   echo "\n\nchunk received:\n", $data; // process your chunk here
   return strlen($data); // returning non-positive number aborts further transfer
});

As was already mentioned in the comments though, if your response content type is HTML that you're loading into DOMDocument, you'll need the full data first anyway.

lafor
  • 12,472
  • 4
  • 32
  • 35
1

you can do two things:

a) use a SAX parser. A Sax parser is like a DOM parser, but it can deal with streaming input where a DOM parser has to have the whole document, or it will throw errors. The Sax parser will just feed you events to process.

What is the difference between SAX and DOM?

b) when using the SAX parser, pass it data incrementally using CURLOPT_WRITEFUNCTION .. just saw that lafor also posted this, so upvoting that

Community
  • 1
  • 1
Zak
  • 24,947
  • 11
  • 38
  • 68