0

I am writing a content scraper that scrapes information about shirts on a particular webiste. I have everything set up with NPM packages in Node to scrape and create a CSV file. The problem I am running into is that as many know, Node is asynchronous in nature. The CSV file I am trying to write is writing before the JSON object I create is finished being created (iterating with an each loop to build it), thus it passes in my 'fields' parameter for json2csv (npm package). But it passes in my data as an empty object. Can anyone tell me how to tell node to wait until my json object is built before trying to use fs.writefile to create the CSV file? Thank you

'use strict';

//require NPM packages

var request = require('request');
var cheerio = require('cheerio');
var fs = require('fs');
var json2csv = require('json2csv');

//Array for shirts JSON object for json2csv to write.
var ShirtProps = [];

var homeURL = "http://www.shirts4mike.com/";

//start the scraper
scraper(); 

//Initial scrape of the shirts link from the home page
function scraper () {
  //use the datafolderexists function to check if data is a directory
  if (!DataFolderExists('data')) {
    fs.mkdir('data');
  }
  //initial request of the home url + the shirts.php link
  request(homeURL + "shirts.php", function (error, response, html) {
    if (!error && response.statusCode == 200) {
      var $ = cheerio.load(html);

      //scrape each of the links for its html data
      $('ul.products li').each(function(i, element){
        var ShirtURL = $(this).find('a').attr('href');
        console.log(ShirtURL);
        //pass in each shirtURL data to be scraped to add it to an object
        ShirtHTMLScraper(ShirtURL);
      }); 
      FileWrite();
      // end first request
    } else {
      console.error(error);
    }
  });
}

//create function to write the CSV file.
function FileWrite() {
  var fields = ['Title', 'Price', 'ImageURL', 'URL', 'Time'];
  var csv = json2csv({data: ShirtProps, fields: fields}); 
  console.log(csv);
  var d = new Date();
  var month = d.getMonth()+1;
  var day = d.getDate();
  var output = d.getFullYear() + '-' +
  ((''+month).length<2 ? '0' : '') + month + '-' +
  ((''+day).length<2 ? '0' : '') + day;

  fs.writeFile('./data/' + output + '.csv', csv, function (error) {
    if (error) throw error;      
  });    
}

//function to scrape each of the shirt links and create a shirtdata object for each.
function ShirtHTMLScraper(ShirtURL) {
  request(homeURL + ShirtURL, function (error, response, html) {
    if (!error && response.statusCode == 200) {
      var $ = cheerio.load(html);
      var time = new Date().toJSON().substring(0,19).replace('T',' ');
      //json array for json2csv
      var ShirtData = {
        title: $('title').html(),
        price: $(".price").html(),
        imgURL: $('img').attr('src'),
        url: homeURL + ShirtURL,
        time: time.toString() 
      };
      //push the shirt data scraped into the shirtprops array
      ShirtProps.push(ShirtData);
      console.log(ShirtProps);

      // //set the feilds in order for the CSV file
      // var fields = ['Title', 'Price', 'ImageURL', 'URL', 'Time'];

      // //use json2csv to write the file -

      // var csv = json2csv({data: ShirtProps, fields: fields}); 
      // console.log(csv);

      // //date for the filesystem to save the scrape with today's date.
      // var d = new Date();
      // var month = d.getMonth()+1;
      // var day = d.getDate();
      // var output = d.getFullYear() + '-' +
      // ((''+month).length<2 ? '0' : '') + month + '-' +
      // ((''+day).length<2 ? '0' : '') + day;

      //   //use filesystem to write the file, or overrite if it exists.
      //     fs.writeFile('./data/' + output + '.csv', csv, function (error) {
      //       if (error) throw error;

      //     }); //end writeFile
    } else {
      console.error(error);
    }
  });
}

//Check if data folder exists, source: http://stackoverflow.com/questions/4482686/check-synchronously-if-file-directory-exists-in-node-js
function DataFolderExists(folder) {
  try {
    // Query the entry
    var DataFolder = fs.lstatSync(folder);

    // Is it a directory?
    if (DataFolder.isDirectory()) {
      return true;
    } else {
      return false;
    }
  } //end try
  catch (error) {
    console.error(error);
  }
}
Season
  • 4,056
  • 2
  • 16
  • 23
BrokenWings
  • 169
  • 12

1 Answers1

3

It's not so much about node being asynchronous in nature as it is about certain functions being asynchronous. In this case, it's the calls using request that are asynchronous. You're calling FileWrite directly after the second request call (the one inside ShirtHTMLScraper) begins. Place the call to FileWrite in the callback of ShirtHTMLScraper, after populating ShirtProps.

edit: After looking closer, that won't work either. The problem is that you are calling an asynchronous function inside a synchronous loop. You can get that to work by creating a counter that increments on each asynchronous callback and checks to see if you've hit the length of the item you're iterating over. If you're on the last iteration, run FileWrite.

A better way to go might be to check out the Async library. You can use .each() to supply two callbacks, one to run on each iteration, and one to run when they've all finished.

Matt Broatch
  • 710
  • 4
  • 16
  • Matt, I tried this, it still doesn't work, it shows blank data for each of the times filewrite is called, underneath the correct fields i have set up. Also if I place it in the callback function for ShirtHTMLScraper it then writes the file every time it iterates over each of the shirts links, I want it to write the file once after it has populated the object containing the key value pairs for the ShirtProps object. – BrokenWings Jul 25 '16 at 01:40
  • Is the console.log of ShirtProps after the push empty? Because if it is correct, and FileWrite fails when being called right after it, there is a problem with FileWrite. I see what you mean about the repeated writes, and I'll look at that too – Matt Broatch Jul 25 '16 at 01:50
  • No, the console.log of ShirtProps is fine, and actually shows it building the object after it tries to filewrite, as console.log gets called each time it iterates over a shirt link and grabs the data, I did that so i could see the object actually being built 1 shirt at a time, but again the object is finished building after it tries to write the file :( – BrokenWings Jul 25 '16 at 01:52
  • I'm saying that if you were to put the call to FileWrite directly after the console log of ShirtProps, you know for a fact that the data is in ShirtProps before calling FileWrite (console.log is certainly synchronous, after all!). You WILL have the problem of too many writes, which I changed my answer to reflect. Since fs.writeFile is also asynchronous, I wonder if calling multiple writes is messing with its ability to write. Seems like a long shot though, since it should have at least one piece of data when it tries to write the first time, and you should see at least that one piece. – Matt Broatch Jul 25 '16 at 02:03
  • I figured out why I wasn't getting any data to write. My Key value pairs were not exactly the same as the fields I was trying to use for the json2csv node package DOH! Now I am trying to figure out how to pass the number of product links scraped in the first scraper into the second scraper so I can use a counter on it and have file write run only once the counter matches the same number of product links on the page. It is showing as undefined when I define it globally and pass the value of the ul products li in the first scraper. – BrokenWings Jul 25 '16 at 02:58
  • Edit, I fixed it, i just created a variable NumberOfShirts in my original request and passed it into the ShirtHTMLScraper function along with the ShirtURL, in the ShirtHTMLScraper function I add to a global counter variable each time its called, and if it meets the numberofshirts links it runs the fixed writefile function. Thanks for all your help! – BrokenWings Jul 25 '16 at 03:24