0

I'm writing a web scraper that uses regex to extract information in a paragraph and store it in an object. Then I add the object to an array. Here's my full code:

function scrapeCourseData(htmlString) {
  // scrapes a specific department's course list 

  var tempArr = [];

  console.log(tempArr); // outputs '[]' 

  $ = cheerio.load(htmlString);

  // #coursestextcontainer contains the actual information for every single course listed in a department                           
  $('#coursestextcontainer').find('.courseblock').each(function(i, elem) {
    // finds all divs of type courseblock, iterates though each of them,
    // extracting course information from children. 

    console.log('courseblock ' + (i + 1));

    var courseText = $('strong', '.courseblocktitle', elem).text(); // Gets the text that will be parsed 

    var regex = /([A-Z]{4}\s[A-Z]{1,2}\d{4})\s(.*?)(?:\.*)(\d{1,2}(?:\.?|-?)\d{0,2}\spoints?)/g;
    var regexGroups = Object.freeze({
      NUMBER: 1,
      NAME: 2,
      CREDITS: 3
    });

    var match, course;

    while ((match = regex.exec(courseText)) !== null) { // when regex.exec returns null, no more matches, and loop stops.
      course = {
        number: match[regexGroups.NUMBER],
        name: match[regexGroups.NAME],
        credits: match[regexGroups.CREDITS]
      };

      tempArr.push(course); // doesn't work-- result is array full of 'null'
      console.log(course); // but this outputs as a valid object, e.g. { number: 'AFAS W3030'... }
    }

  });

  console.log("Complete tempArr: " + tempArr); // outputs [object Object],[object Object],[object Object], etc. 

  for (var j of tempArr) {
    dataJSONObject.push(tempArr[j]);
    console.log('\ntempArray at ' + j + ': ' + tempArr[j]); // outputs [object Object]: undefined
  }

  console.log('\n');
}

When I first define tempArr as [] and output it to the console, I get the expected result [].

The objects I form from regex matches are also valid as expected at runtime.

However, when I try to push those objects to tempArr, and then print tempArr, it outputs as undefined.

I've been poking around other stackoverflow questions and I'm pretty sure my problem is that when I'm pushing to tempArr, I'm doing so outside of its scope. I've tried moving around where I declare tempArr (e.g. by putting it outside its function to make it global), but I still get undefined after pushing. What am I missing?

rcoppy
  • 27
  • 1
  • 5
  • 1
    tempArray is `JSON` instead of `console.log("Complete tempArr: " + tempArr);` try `console.log("Complete tempArr: " + JSON.stringify(tempArr));` – Nishanth Matha Jan 06 '17 at 05:20
  • I think you need to manually set `reg.lastIndex = 0;` . Look at [this](http://stackoverflow.com/questions/11477415/why-does-javascripts-regex-exec-not-always-return-the-same-value) and [this](http://stackoverflow.com/questions/1520800/why-regexp-with-global-flag-in-javascript-give-wrong-results) – Sandeep Nayak Jan 06 '17 at 05:21

1 Answers1

1

The objects you're pushing into the array are there, you're just reading out of the array incorrectly. A for...of loop doesn't put the index value in the variable you supply, it puts the value. This explains why tempArr[j] is undefined.

Change your for...of loop to this:

for (var j of tempArr) {
    dataJSONObject.push(j);
    console.log('\ntempArray: ' + j);
}

Also, another way to put all the elements of one array into another is to use the spread syntax:

dataJSONObject.push(...tempArr);
4castle
  • 32,613
  • 11
  • 69
  • 106
  • That worked, thanks! The only other thing I had to do was stringify the `j` value inside `console.log()`. (I think `console.log({object})` implicitly calls `stringify()` on the passed object, but `console.log('string' + {object})` doesn't, because after concatenation the input is `string` instead of `JSON`.) – rcoppy Jan 06 '17 at 16:48