0

I am doing some api to crawler from many sources (with millions of records) but I have the problem involve in out of memory. I googled and found some resource but it doesn't solve my problem.

similar question that did not solve my problem

here are my sample code:

function getContent() {

    let d = q.defer();

    let urls = [];

    array.forEach(function(mang, index) {
        for (let i = 1; i <= 600000; i++) {
            urls.push(function (callback) {
                setTimeout(function () {
                    let link = 'http://something.com/' + i;
                    let x = link;

                    let options = {
                        url: link,
                        headers: {
                            'User-Agent': 'something'
                        }
                    };

                    function callback1(error, response, html) {
                        if (!error) {
                            let $ = cheerio.load(html);
                            let tag_name = $('h1').text();
                            tag_name = tag_name.trim();
                            let tag_content = $('#content-holder').find('div').text();
                                let tagObject = new Object();

                                tagObject.tag_name = tag_name;
                                tagObject.tag_content = tag_content;
                                tagObject.tag_number = i;

                                tagArray.push(tagObject);

                                    for (let v = 0; v < tagArray.length; v++) {
                                        //console.log("INSERT INTO `tags` (tag_name, content, story_id, tag_number) SELECT * FROM (SELECT " + "'" + tagArray[v].tag_name + "'" + "," + "'" + tagArray[v].tag_content + "','" + array[c].story_id + "','" + tagArray[v].tag_number + "' as ChapName) AS tmp WHERE NOT EXISTS (SELECT `tag_name` FROM `tags` WHERE `tag_name`=" + "'" + tagArray[v].tag_name + "'" + ") LIMIT 1");
                                        db.query("INSERT INTO `tags` (tag_name, content) " +
                                            "SELECT * FROM (SELECT " + "'" + tagArray[v].tag_name + "'" + "," + "'" + tagArray[v].tag_content + "','" + "' as TagName) AS tmp " +
                                            "WHERE NOT EXISTS (SELECT `tag_name` FROM `tags` WHERE `tag_name`=" + "'" + tagArray[v].tag_name + "'" + ") " +
                                            "LIMIT 1", (err) => {
                                            if (err) {
                                                console.log(err);
                                            }
                                        });
                                    }
                                    urls = null;
                                    global.gc();

                                console.log("Program is using " + heapUsed + " bytes of Heap.")

                        }
                    }

                    request(options, callback1);
                    callback(null, x);
                }, 15000);
            });
        }
    });

    d.resolve(urls);
    return d.promise;
}

getContent()
    .then(function (data) {
        let tasks = data;
        console.log("start data");
        async.parallelLimit(tasks, 10, () => {
            console.log("DONE ");
        });
    })

I tried to use global.gc() function but it seems to be not effectively

Keitaro Urashima
  • 841
  • 2
  • 9
  • 19
  • 1
    You *really* need to create a minimally viable question. Show your work, explain what you're looking for, etc. This question as is requires us to make far too many assumptions about your approach. – Paul Jun 15 '17 at 16:12
  • @Paul I edited it by giving some example code :( – Keitaro Urashima Jun 16 '17 at 11:05
  • 1
    Added an answer but something I don't understand, what is in `array`? You're doing a forEach on it but I can't see where you're using the items in it for anything. Also you realize you're adding at least 1.2 million functions to the heap *per item* in `array`, right? I say at least because I can't recall if callback1 will be created once or once per scope,, pretty sure it's the latter. – Paul Jun 16 '17 at 12:39

1 Answers1

1

Ah, I see your problem now. You're trying to do it all in memory in one loop. That way lies madness for any nontrivial amount of work as each of those anonymous functions you're creating is added to the heap. Plus, it's not very robust. What happens if you get a network outage on the 450,000th crawl? Do you lose it all and start over?

Look to having a job that runs in smaller batches. I've used task managers like Kue for this before but frankly all you need to do is start by populating your array of URLs by some reasonable number like 10 or 25. One way is to have a table with all the URLs in it and either a flag that they've been successfully crawled, or a last crawled date if you plan to do them again I've time.

Then query for all URLs that haven't been crawled (or have a last crawled earlier than some date like a week ago) and limit the results to 10 or 25 or whatever. Crawl and store those first, I'd probably use something like async.js#map or Promise.all to do that rather than the loop you're currently using.

If all the URLs are hitting the same domain, you probably want to have a short timeout between each request just to be respectful of those resources.

After a batch is done, query your DB for the next batch and repeat.

Depending on your architecture, it might be better to have this program be more simple, doing nothing but getting the one batch and resolving the crawl for the one batch. You can then run it on a cron job or as a Windows service to run every 5 minutes or every 15 minutes or whatever.

On mobile right now, but I'll try and get on a laptop later to give you a code example if you need it.

Paul
  • 35,689
  • 11
  • 93
  • 122