2

I'm fiddling around with using Node.js to scrape data from an e-commerce site. I use Request to retrieve the DOM of the page and Cheerio to do server-side DOM selection.

const cheerio = require('cheerio');
const request = require('request');

// takes a URL, scrapes the page, and returns an object with the data
let scrapePage = (url) => {

    return new Promise((resolve, reject) => {

        request(url, (error, resp, body) => {

            if(error){
                reject(error);
            };

            let $ = cheerio.load(body); 
            let $url = url;
            let $price = $('#rt-mainbody > div > div.details > div.Data > div:nth-child(4) > div.description').text();

            let obj = {
                url: $url,
                price: $price
            }

            resolve(obj);

        });

    });

};

// Runs scrapePage in a loop
// There is a variable called arrayOfURLs defined elsewhere that contains 100s of URLs

for( let i = 0; i < arrayOfURLs.length; i++){
    scrapePage(arrayOfURLs[i])
        .then((obj) => {
            //write to a file
        })
        .catch((error) => {
        })
};

The problem is that the server that I send requests to sometimes sends back blank data, I'm assuming because I'm sending too many requests without any kind of pause. Due to the async nature of JS I'm having a hard time figuring out how to add an effective delay between each iteration of the loop. It's not enough to just add a setTimeOut in a synchronous fashion because setTimeOut itself is async, and I'm running this on the server so there's no Window object.

EDIT

The code above is a simplified version of what I'm working on. The entire code is this:

app.js

const fs = require('fs');
const path = 'urls.txt';
const path2 = 'results.txt';
const scraper = require('./scraper');

let scrapePage = (url) => {
    scraper.scrapePage(url)
        .then((obj) => {
            // console.log('obj from the scraper with Promises was received');
            // console.log(obj);
            // console.log('writing obj to a file');
            fs.appendFile(path2, JSON.stringify(obj) + ', ', (error) => {
                if(error){
                    console.log(error);
                } else {
                    // console.log('Successfully wrote to ' + path2);
                }
            })
        })
        .catch((error) => {
            console.log('There was an error scraping obj: ');
            console.log(error);
        })  
}

fs.readFile(path, 'utf8', (err, data) => {

  if (err){
    throw err;
  };

  var urlArray = JSON.parse(data);

  // this returns an Unexpected Identifier error    
  // const results = await Promise.all(urlArray.map(scrapePage));

  // this returns an Unexpected Token Function error
  // async function scrapePages(){
  //    const results = await Promise.all(urlArray.map(scrapePage));
  // };

});

scraper.js

const request = require('request');
const cheerio = require('cheerio');

exports.scrapePage = (url) => {
    return new Promise((resolve, reject) => {
        request(url, (error, resp, body) => {
            if(error){
                reject(error);
            };

            let $ = cheerio.load(body); 
            let $url = url;

            let $price = $('#rt-mainbody > div > div.details > div.itemData > div:nth-child(4) > div.description').text();

            let obj = {
                url: $url,
                price: $price
            }

            resolve(obj);

        })
    })
}
fuzzybabybunny
  • 5,146
  • 6
  • 32
  • 58
  • Possible duplicate of [What is the JavaScript version of sleep()?](https://stackoverflow.com/questions/951021/what-is-the-javascript-version-of-sleep) – Obsidian Age Dec 18 '17 at 01:32
  • you cannot rely on assuming the data will be there after a given amount of time. Try an approach with a callback function instead. – Ctznkane525 Dec 18 '17 at 01:33
  • 1
    I would investigate this blank data problem, at least log the headers and response code so you can figure out where the error is. Seems like you are guessing and why make changes before you know what the error is? – Ryan Doherty Dec 18 '17 at 01:55
  • @RyanDoherty, the blank data problem is inconsistent. Out of 100 URLs, maybe 25% will return blank. On repeated runs the URLs that previously returned blank will return data. When I manually go to the page on the browser the URLS all return correct data. When I console log the `body` of the pages that return blank I get `127.0.0.1 user_items` – fuzzybabybunny Dec 18 '17 at 02:01
  • I don't agree with @ryan doherty, the blank responses is likely either a server issue or a bug in request which is caused by making so many concurrent requests. Throttling will likely solve it. – pguardiario Dec 19 '17 at 09:45
  • @pguardiario, how do I throttle or, back to my initial question, add a delay and a timeout between requests? – fuzzybabybunny Dec 19 '17 at 23:46
  • How do we know the problem is too many requests and throttling will solve the problem? A blank page could mean anything, which is why we need to log the headers and response code from the blank pages. Otherwise you are just making assumptions about what the problem is. – Ryan Doherty Dec 21 '17 at 01:16
  • @fuzzybabybunny - just slice the array up and await some small number at a time. – pguardiario Dec 22 '17 at 00:31

3 Answers3

3

Looks to me like you aren't waiting for your promises to resolve before you send the server response. You could completely eliminate the for loop using using async / await e.g.

const results = await Promise.all(arrayOfURLs.map(scrapePage));
James
  • 80,725
  • 18
  • 167
  • 237
  • I had a small code edit. I'm already waiting on the promise to resolve with `.then`. The `scrapePage` method returns a Promise, and I wait for it to resolve with `.then`, after which I write the results to a file. – fuzzybabybunny Dec 18 '17 at 02:27
  • @fuzzybabybunny yeah so that's fine for post processing each scrape, however my concern is you aren't waiting for *all* the promises to resolve before you return the HTTP response - this is the most likely cause of the server returning "blank" responses because it's returning *before* the scraping has finished. – James Dec 18 '17 at 02:34
  • I've edited my original post to include my actual code. I added your bit with `await Promise.all` but I'm getting errors - they're described in my updated code. I'm running the `scrapePage` function in the callback function of a method I run to read the URLs from a text file. – fuzzybabybunny Dec 18 '17 at 02:50
  • @fuzzybabybunny what version of node are you using? – James Dec 18 '17 at 09:18
  • I'm on v6.12.1. Is this behavior not normal? – fuzzybabybunny Dec 18 '17 at 09:53
  • At the end of the day, there still *needs* to be a way to space out requests so that I don't hammer the site I'm scraping and cause my IP address to get banned. Using Promise is fine for avoiding asynchronous hell on my end but does nothing to space out requests, right? – fuzzybabybunny Dec 18 '17 at 10:04
  • @fuzzybabybunny nope, `async` / `await` was only introduced in v7, I'd recommend upgrading your Node version regardless, v6 is _way_ out of date. The idea of spacing out the requests is a completely different problem, the problem you discuss in your question is related to the server returning a blank response... – James Dec 18 '17 at 11:06
  • I still have the same error. I'm on v8.9.3 now. I even created an entirely new project after upgrading node but the same `Unexpected Identifier` error occurs on `const results = await Promise.all(urlArray.map(scrapePage));` inside of `fs.readFile` – fuzzybabybunny Dec 18 '17 at 17:52
  • @fuzzybabybunny the `await` keyword is only valid inside an `async` function, if you're calling this inside a callback then you need to declare it as an `async` funcion e.g. `fs.readFile('...', 'utf-8', async (err, data) => { /* you can now use await in here */ });`. Alternatively, `await` the `readFile` call too then you wouldn't need the callback func at all. – James Dec 18 '17 at 19:02
1

If you want to have no more than x amount of active connections you could use throttle. Or if you want no more than x amount per second you could use throttlePeriod.

Using Promise.all will never call your resolve handler if only one request fails so you could catch any errors and return a Fail object

const Fail = function(details){this.details=details;};
const max10 = throttle(10)(scrapePage);//max 10 active connections
//const fivePerSecond = throttlePeriod(2,1000)(scrapePage); //start no more than 2 per second
Promise.all(
  arrayOfURLs.map(
    url =>
      max10(url)
      .catch(err=>new Fail([err,url]))
  )
)
.then(
  results =>{
    successes =
      results.filter(
        result=>(result&&result.constructor)!==Fail
      );
    failed =
      results.filter(
        result=>(result&&result.constructor)===Fail
      )
  }
);
HMR
  • 37,593
  • 24
  • 91
  • 160
1
const cheerio = require('cheerio');
const request = require('request');
let scrapePage = (url) => {

return new Promise((resolve, reject) => {

    request(url, (error, resp, body) => {

        if(error){
            reject(error);
            return;
        };

        if(!body) {
             reject('Empty Body');
             return;
        }


        let $ = cheerio.load(body); 

        let $url = url;
        let $price = $('#rt-mainbody > div > div.details > div.Data > div:nth-child(4) > div.description').text();

        let obj = {
            url: $url,
            price: $price
        }

        resolve(obj);

    });

});
};

function processUrl(url){
 scrapePage(url)
    .then((obj) => {
        //write to a file
        if(i < arrayOfURLs.length) 
            processUrl(arrayOfURLs.pop())
    })
    .catch((error) => {
       arrayOfURLs.unshift(url);
        if(i < arrayOfURLs.length)  // put this in finally block
            processUrl(arrayOfURLs.pop())
    })
};
processUrl(arrayOfURLs.pop());

Here we can use arrayOfUrls arrays as queue and if we received an error or blank page, we push this URL in array again. in that way we can process every URL in a synchronous fashion.

Ankit Rana
  • 383
  • 6
  • 24