0

I want to scrape some webpages and get some data from them in Node js. My code is working but it takes almost 1 minute to finish scraping and returning all the data. I've used async functions for each website and promises to gather all information. There are at most 100 hundred links that I've worked on it. I think the running time is too much for this. Is there any issue in my code's structure (the usage of request-promise, promises, async, await etc...) that causes the delay? All functions can run in parallel/asynchronous but my constraint is I need to wait until all the results come from each website. I've limited the timeout of each request to 10 seconds. If I decrease it much more, the existing ETIMEDOUT, ECONNRESET, ESOCKETTIMEDOUT errors (which I still couldn't get rid of) increases.

Here is one of my scraping functions:

const rp = require('request-promise');
const cheerio = require('cheerio');
const fs = require("fs");
const Promise = require("bluebird");

async function ntv() {
    var posts = [];
    try {
        const baseUrl = 'http://www.ntv.com';
        const mainHtml = await rp({uri: baseUrl, timeout: 10000});
        const $ = cheerio.load(mainHtml);
        const links =
            $(".swiper-slide")
                .children("a")
                .map((i, el) => {
                    return baseUrl + $(el).attr("href");
                }).get();

        posts = await Promise.map(links, async (link) => {
            try {
                const newsHtml = await rp({uri: link, timeout: 10000});
                const $ = cheerio.load(newsHtml);
                return {
                    title: $("meta[property='og:title']").attr("content"),
                    image: $("meta[property='og:image']").attr("content"),
                    summary: $("meta[property='og:description']").attr("content")
                }
            } catch (err) {
                if (err.message == 'Error: ETIMEDOUT') console.log('TIMEOUT error ' + link);
                else if (err.message == 'Error: read ECONNRESET') console.log('CONNECTION RESET error ' + link);
                else if (err.message == 'Error: ESOCKETTIMEDOUT') console.log('SOCKET TIMEOUT error ' + link);
                else console.log(err);
            }
        })
    } catch (e) {
        console.log(e)
    }
    return posts;
}

My main function that runs all these scraping functions is this:

var Promise = require("bluebird")
var fs = require("fs")

async function getData() {
    const sourceFunc = [func1(), func2(), ... , func10()];
    var news = [];

    await Promise.map(sourceFunc, async (getNews) => {
        try {
            const currentNews = await getNews;
            news = news.concat(currentNews);
        } catch (err) {
            console.log(err);
        }
    },{concurrency:10});

    news.sort(function(a,b){
        return new Date(b.time) - new Date(a.time);
    });
    fs.writeFile('./news.json', JSON.stringify(news, null, 3), (err) => {
        if (err) throw err;
    });
    return news;
}
Huseyin Sahin
  • 211
  • 4
  • 16
  • `sourceFunc` and `getNews` look wrong. What is your exact code? And how does this relate to the `ntv` function from the first snippet? – Bergi Nov 28 '18 at 22:18

1 Answers1

2

I would start by adding some benchmarks to your script. Figure out which step takes the most time in ntv() function and tweak it.

My other guess is that parsing the entire html with cheerio is a bottleneck. It could be more performant to use String.prototype.substring() or RegExp() to extract links and post information.

UPDATE:

See if concurrent TCP connections isn't a bottleneck. Here are some tips on how to check/adjust it.

If concurrency is the problem, perhaps it makes sense to split the job into several programs. e.g.

  1. Process #1 generates a list of URLs to be fetched
  2. Process #2 takes an URL from the list, fetches HTML from it and saves locally
  3. Process #3 takes an HTML and parses it

If you split the job like this you can parallelize it better. For instance, node works on one core only, with parallelization you can run multiple processes to, e.g. do the fetching, thus benefit from multiple cores. And also circumvent any per-process limits on connections etc.

If URLs and HTML are saved into a shared DB, you can distribute the tasks between multiple machines improving performance further.

Eriks Klotins
  • 4,042
  • 1
  • 12
  • 26
  • 1
    Spoiler alert: it's the network (probably), probably exacerbated by OS limits on simultaneous TCP connections. I'd rule that out first by manually scraping the largest site and accessing that locally. – danh Nov 28 '18 at 20:58
  • You are totally right that request process determines the total running time of my code. Some websites give late response. Actually I've tried the code both on my laptop and on the amazon aws lambda functions but total running time is similar. I just want to be sure that there is no issue in my code related to the usage of promises, async functions etc.. In my code the url's or html are not saved, I only save the final result to amazon s3. – Huseyin Sahin Nov 28 '18 at 21:23
  • @HüseyinŞahin - If some requests are taking a long time to return, then you should probably increase the `concurrency` value you pass to `Promise.map()` to something a lot higher. You want to make sure you keep the single node.js completely busy so it's never just waiting for networking. It's very simple to experiment with that concurrency value. – jfriend00 Nov 28 '18 at 21:28
  • @jfriend00 I actually added the concurrency after I took too many timeout, connection reset errors. But thanks I will increase it and try again. – Huseyin Sahin Nov 28 '18 at 21:33
  • @HüseyinŞahin - Yes, some level of concurrency control is required or you'll be attempting too many open sockets at once and perhaps too many requests of the same host at the same time, but you also need to keep your pipeline full to keep your processor busy. It is a tradeoff. An optimized approach might also separate out URLs by host so you limit the number of simultaneous requests to the same host to something smaller, but have lots more concurrent requests to separate hosts. – jfriend00 Nov 28 '18 at 23:05
  • 1
    It's worth pointing out that 100 sites/min (600ms per site) isn't half bad. – danh Nov 29 '18 at 02:58