5

So I'm making a little scraper for learning purposes, in the end I should get a tree-like structure of the pages on the website.

I've been banging my head trying to get the requests right. This is more or less what I have:

var request = require('request');


function scanPage(url) {

  // request the page at given url:


  request.get(url, function(err, res, body) {

    var pageObject = {};

    /* [... Jquery mumbo-jumbo to

        1. Fill the page object with information and
        2. Get the links on that page and store them into arrayOfLinks 

    */

    var arrayOfLinks = ['url1', 'url2', 'url3'];

    for (var i = 0; i < arrayOfLinks.length; i++) {

      pageObj[arrayOfLinks[i]] = scanPage[arrayOfLinks[i]];

    }
  });

    return pageObj;
}

I know this code is wrong on many levels, but it should give you an idea of what I'm trying to do.

How should I modify it to make it work? (without the use of promises if possible)

(You can assume that the website has a tree-like structure, so every page only has links to pages further down the three, hence the recursive approach)

Patrick Roberts
  • 49,224
  • 10
  • 102
  • 153
Gloomy
  • 1,091
  • 1
  • 9
  • 18
  • You would probably need an html parser. Try googling something like "javascript html parser"... – Daniel May 31 '16 at 13:09
  • Thank you, but it has nothing to do with my question. I parse the html with cheerio (node.js jquery implementation), my problem is how to handle recursively building my object. – Gloomy May 31 '16 at 13:18
  • The biggest challenge here is to achieve recursive behavior due to async nature for javascript. – AJS May 31 '16 at 13:23
  • I wanted to achieve something similar a while back, with little the time i had i decide to go with https://www.npmjs.com/package/sync-request – AJS May 31 '16 at 13:26
  • AJS: Hmm, I'll try that until a better solution arises – Gloomy May 31 '16 at 14:04
  • "*without the use of promises if possible*" - actually, that would simplify it a lot. – Bergi May 31 '16 at 14:31
  • _"Jquery mumbo-jumbo"_ - I actually was not aware that Jquery had [a port for node](https://www.npmjs.com/package/jQuery). That's interesting. – Patrick Roberts May 31 '16 at 15:53

1 Answers1

1

I know that you'd rather not use promises for whatever reason (and I can't ask why in the comments because I'm new), but I believe that promises are the best way to achieve this.

Here's a solution using promises that answers your question, but might not be exactly what you need:

var request = require('request');
var Promise = require('bluebird');
var get = Promise.promisify(request.get);

var maxConnections = 1; // maximum number of concurrent connections

function scanPage(url) {

    // request the page at given url:

    return get(url).then((res) => {

        var body = res.body;

        /* [... Jquery mumbo-jumbo to

        1. Fill the page object with information and
        2. Get the links on that page and store them into arrayOfLinks

        */

        var arrayOfLinks = ['url1', 'url2', 'url3'];

        return Promise.map(arrayOfLinks, scanPage, { concurrency: maxConnections })
                            .then(results => {
                                var res = {};
                                for (var i = 0; i < results.length; i++)
                                    res[arrayOfLinks[i]] = results[i];
                                return res;
                            });

    });

}

scanPage("http://example.com/").then((res) => {
    // do whatever with res
});

Edit: Thanks to Bergi's comment, rewrote the code to avoid the Promise constructor antipattern.

Edit: Rewrote in a much better way. By using Bluebird's concurrency option, you can easily limit the number of simultaneous connections.

Originato
  • 346
  • 3
  • 6
  • Avoid the [`Promise` constructor antipattern](http://stackoverflow.com/q/23803743/1048572)! You should only promisify `request.get` using it, and then chain the rest of the code to it using `.then(…)`. – Bergi May 31 '16 at 14:58
  • Don't run this on something like wikipedia... you may just hog all the bandwidth on your local network, heat up your CPU and possibly be suspected of DDoSing the website or something. Also try to prevent cyclical links from doing something like `url1 -> url2 -> url1 -> ...`. – Patrick Roberts May 31 '16 at 15:57
  • I had come to a similar solution, the problem is that all requests fire at the same time and the server is *not happy* (cf. what Patrick Roberts says). I tried doing it sequantially with reduce() but it's a bit too advanced for me, so that's why I was asking for a "classical" solution. – Gloomy May 31 '16 at 16:07
  • `var promises = arrayOfLinks.map(scanPage);` – Thomas Jun 01 '16 at 01:39