3

I use curl all the time at the command line, to request an url and parse it's markup.

I do this easily for authenticated pages by going in Chrome, loading the url, and then opening the inspector, finding the url at the top of the Network history, right clicking it, and choose Copy | Copy as Curl

I'd like to do the same with a single page application, that of course runs tons of other things to render itself, like javascript, or whatever.

Are there any tools out there that will let me easily change the "curl" to something else, and it will download the generated source of the page?

e.g. Normally I'd run this to get the source of the authenticated page if it wasn't a single page application (copied from Chrome)

curl 'https://mywebsite.com/singlePageApplication' \
  -H 'Connection: keep-alive' \
  -H 'Pragma: no-cache' \
  -H 'Cache-Control: no-cache' \
  -H 'Accept-Language: en,en-US;q=0.9' \
  -H 'Cookie: session=XXX"

I'd like to be able to just switch that to something else, and it take in all the headers and preferably, exactly the same syntax as curl, and give me the generated source.

downloadGeneratedSource 'https://mywebsite.com/singlePageApplication' \
  -H 'Connection: keep-alive' \
  -H 'Pragma: no-cache' \
  -H 'Cache-Control: no-cache' \
  -H 'Accept-Language: en,en-US;q=0.9' \
  -H 'Cookie: session=XXX"

Does this exist anywhere?

Brad Parks
  • 66,836
  • 64
  • 257
  • 336
  • 2
    Probably the most widely used tools for this are Puppeteer and Selenium, not curl. – root May 26 '20 at 03:05
  • Thanks for the feedback! I know that curl can't do it, but I'm basically looking to see if anyone has written a script against anything like Pupeteer, PhantomJS or Selenium to do this - take curl syntax for headers, and request parameters, and get the generated source of a page. I will take a crack at this myself if no one else has already done it – Brad Parks May 26 '20 at 10:37
  • kinda like ```console.log(document.body.parentNode.outerHTML);``` ? – hanshenrik May 26 '20 at 18:58
  • yeah - in a browser it'd be something like that. Like `body.innerHTML` ? – Brad Parks May 27 '20 at 12:26
  • Not sure whether https://chrome.google.com/webstore/detail/curlwget/jmocjfidanebdlinpbcdkcmgdifblncg is of any help? – B--rian May 27 '20 at 16:08

2 Answers2

4

As root and Brad Parks pointed out in their comment, Selenium, PhantomJS or Pupeteer are fancy tools designed to emulate the behavior of a browsing user and thus allow you to download the source code of single-page app (SPA) in an easy-configurable manner.

On the other hand, you are right that cURL can do similar things if used in a script. In the early 2000s I used wget in combination with grep, awk, sed and perl to automatize the regular download of access-controlled pages with dynamic URLs created using CGI. This is indeed a scenario very comparable to nowadays SPAs.

I chose wget over curl because pipe-processing its output was easier, but it was necessary to tailor such a script to your specific use-case. If you are fluent in RegEx, that was a job a couple of minutes since the target URLs had some syntax I could look for - maybe you could do the same?

Further reading

B--rian
  • 5,578
  • 10
  • 38
  • 89
  • @BradParks If you elaborate a bit on the specific problem I could help with RegEx, awk, etc. – B--rian May 30 '20 at 13:51
  • The problem is that curl itself won't work, as the page in question does some javascript onload stuff that waits for the dom to be ready, then does some additional work on the page, which I'll have to detect using javascript, since it makes dynamic changes to the page. So though I like curl syntax, it will ultimately have to be a tool like Selenium, PhantomJS or Pupeteer that solves the problem, I expect. – Brad Parks Jun 01 '20 at 10:42
  • Javascript inside the pages is indeed a problem for curl, wget, lynx etc. Other than the above mentioned tools, I can also recommend https://www.crummy.com/software/BeautifulSoup/ - that is the python approach... – B--rian Jun 01 '20 at 10:46
1

I googled around and found a similar implementation in phantomjs, and tried to modify it to fit this use case, though it didnt seem to work. I can't seem to find the gist I based this off of ;-( but am throwing this up here as at least a crack at the solution ;-)

Side note: I just found this python webdriver approach that could work when doing more googling

downloadGeneratedSource

var argIs,getArg,d;
var customHeaders = {};

// grab the "rendered" HTML of a JavaScript-requiring web page

// TBD:
// add '-' as magic filename to read from STDIN
// add more curl-like switches? --or just let curl do that and consume the output of curl?
// add a switch for page.render( URLdump); // "screenshot"

var system = require('system'); // var args = require('system').args;
var page = require('webpage').create();

if (system.args.length === 1) {
    console.log('Usage: curl-phantom.js <http://URL/path/file.ext>');
    console.log(system.args);
    // note: can also read "pages" from the local filesystem
    phantom.exit();
};

var URLarg=system.args[1];
var theStatusCode = null;
var theStatusPrev = null;
var thePrevURL    = ''  ;
var theCurrURL    = ''  ;
var timestamp     = Date.now();
var verbose       = false;
var debug         = true;
var full_page     = false;
var header_key    = 'X-Forwarded-For';
var header_val    = '3.1.20.13';
var requestTimeout= 5000;   // Default request timeout

argIs = function(i, name){
  if (system.args[i].indexOf(name) == 0 ) {
    return true;
  }
  return false;
}

getArg = function(i) {
  return system.args[i].trim();
}

v = function(a,b) {
  verbose && console.log(a,b)
}

d = function(a,b) {
  debug && console.log(a,b)
}

for (var i=1; i<system.args.length; i++) { 
  if (argIs(i, '--debug')) {
    debug = true; 
    d('DEBUG: ' + getArg(i)); 
  }
  else if (argIs(i, '--full_page')) {
    full_page = true; 
    d('PAGE: ' + getArg(i)); 
  }
  else if (argIs(i, '-H', '--header')) {
    var arg = getArg(++i); 
    var arr = arg.trim().split(/\s*:\s*/);
    var header = {};
    var key = arr[0];
    var value = (arr.length == 2) ? arr[1] : '';
    customHeaders[key] = value;

    d('HEADER:', [key, value]);
  }
  else if (argIs(i, '--verbose')) {
    verbose   = true; 
    v('VERBOSE: ' + getArg(i)); 
  }
  else if (argIs(i, '--timeout')) {
    requestTimeout = getArg(++i);  
    d('REQUEST_TIMEOUT', requestTimeout);
  }
  else {
    console.log('unknown param: '+getArg(i)); 
  }
}
console.log('################');
console.log('headers and values');
console.log(JSON.stringify(customHeaders));

page.settings.resourceTimeout = requestTimeout;

page.customHeaders = customHeaders;
//page.customHeaders = { header_key : header_val };
v('VERBOSE: ' + header_key +': '+ header_val);

page.onConsoleMessage = function (msg) { // call-back function intercepts console.log messages
    d('DEBUG: console.log message="' + msg + '"');
};

page.onLoadFinished = function(status) {
  if ( debug ) {
    // console.log('Status: ' + status +' after onLoadFinished(' + status +')');
    system.stderr.write('OnLoadFinished.Status: ' + (theStatusCode ? theStatusCode : status) +' after onLoadFinished(' + status +')\n');
  }
};

page.onResourceReceived = function(resource) {
  // if (resource.url == URLarg || (theStatusCode >= 300 && theStatusCode < 400)) {
    theStatusPrev = theStatusCode  ;
    theStatusCode = resource.status;
    thePrevURL    = theCurrURL  ;
    theCurrURL    = resource.url;
  // }
    if ( resource.status === 200 ) {
        v('VERBOSE status ' + resource.status + ' for ' + resource.url ); // don't usually log standard success
    } else {
        v('Status Code was: ' + theStatusPrev   + ' for ' + thePrevURL );
        v('Status Code is : ' + theStatusCode   + ' for ' + theCurrURL );
    }
};

page.onUrlChanged = function (URLnew) { // call-back function intercepts console.log messages
    if ( URLnew === URLarg ) {
      d('DEBUG: old/new URL: ' + URLnew + ' --onUrlChanged()');
    } else {
      v('DEBUG: old URL: ' + URLarg);
      v('DEBUG: new URL: ' + URLnew);
    }
};

phantom.onError = function(msg, trace) {
    var msgStack = ['PHANTOM ERROR: ' + msg];
    if (trace) {
        msgStack.push('TRACE:');
        trace.forEach(function(t) {
            msgStack.push(' -> ' + (t.file || t.sourceURL) + ': ' + t.line + (t.function ? ' (in function ' + t.function + ')' : ''));
        });
    }
    console.error(msgStack.join('\n'));
};

page.onResourceTimeout = function(request) {
    console.error('Request timed out due to ' + request.errorCode + ' - ' + request.errorString);
    phantom.exit(1);
}

page.open( URLarg, function () {
    // onLoadFinished executes here
    var page_content = page.content;
    var body_innerHTML= page.evaluate( function() {
      return document.body.innerHTML ? document.body.innerHTML : '(empty)' ;
    });
    var title = page.evaluate(function() {return document.title; });

    // page.render( URLdump); // "screenshot"
    v('VERBOSE: Loading time '+ ( Date.now() - timestamp ) +' msec');
    d('DEBUG: Page title: ' + ((title==='') ? '(none)':title) );
    d('DEBUG: body_innerHTML.length='+ body_innerHTML.length);
    d(' ');

    if ( full_page  || ( ! body_innerHTML ) || body_innerHTML.length < 9 ) {
      console.log( page_content   ); // return all if body is empty
    } else {
      console.log( body_innerHTML );
    }
    setTimeout(function() {
        v('VERBOSE: status ' + theStatusPrev   + ' for ' + thePrevURL + ' (b)');
        v('VERBOSE: status ' + theStatusCode   + ' for ' + theCurrURL + ' (c)');
      }, 1333 ) ; // delay in milliseconds
    phantom.exit( theStatusCode);
  }) ;

Brad Parks
  • 66,836
  • 64
  • 257
  • 336