0

I am processing a folder of 400+ xml files, converting / reducing them to a subset as JSON and then trying to insert MongoDB one JSON file at a time. (the files are too big to push into one large JSON file and simply do mongoimport)

The following code works ok for a path/folder with one xml file only. Except the filename but I can fix that (I think)

The problem is it can only handle one file, which defeats the object. I'm not sure if the issue is my inexperience with node.js style coding... or something MongoDB does allowing the file looping process to continue inserting before the first insert completed.

var fs = require('fs'),
xml2js = require('xml2js');
var parser = new xml2js.Parser();

fs.readdir('/Users/urfx/data', function(err, files) {
    files.filter(function(file) { return file.substr(-4) == '.xml' })
        .forEach(function(file) {
                fs.readFile(file, function(err, data) {
                    // parse some xml files and return reduced set of JSON data (works)
                    parser.parseString(data, function (err, result) {
                        var stuff = [inspectFile(result)];
                            var json = JSON.stringify(stuff); //returns a string containing the JSON structure by default
                            //make a file copy of the transformed data
                            fs.writeFile(file+'_establishments.json', json, function (err) {
                                if (err) throw err;
                                console.log('file saved!');
                                // write to mongoDB collection
                                fs.readFile(file+'_establishments.json', function(err, data) {
                                    mongoInsert(data);
                                });
                            });                            
                    });
            });
    });
});

help! I'm going loopy on this one... It bombs on more than one file. Perhaps the problem is mongodb is still processing the first json array then the second one kicks off.

following the pointers from tandrewnichols I made these improvements. I then faced data errors (perhaps I always did). It does look like a mongo issue because if all the json files import ok one by one... I'm out of time and can't get to the bottom of it because of the individual .json files being too large to visually compare and too different to diff ;)

So I modified the purpose of this routine just to spit out the .json files (//commented out the line that writes to mongo) then I ran a simple shell script to use mongoimport I'll append that here also. That got me where I needed to get.

All things (.json files) being equal the changes below now work so thanks again tandrewnichols.

my solution uses .fs serial loop as opposed to parallel loop (see my comments)

fs.readdir(path, function(err, files) {
files = files.filter(function(file) { return file.substr(-4) == '.xml' })
var i = 0;
(function next() {
        var file = files[i++];
        if (!file) return console.log(null, "end of dir");
        file = path+file;
        fs.readFile(file, function(err, data) {
            // parse some xml files and return reduced set of JSON data (works)
            parser.parseString(data, function (err, result) {
                console.log("3. result = "+result);
                var stuff = xmlToJSON(result);
                var json = JSON.stringify(stuff); //returns a string containing the JSON structure by default
                //make a file copy of the transformed data
                var fileName = file.replace('.xml', '_establishments.json');
                fs.writeFile(fileName, json, function (err) {
                    if (err) throw err;
                    console.log(fileName+' saved!'); // thanks to tandrewnichols
                });
                mongoInsert(stuff); // turns out I have some voodoo in json file output 
                next();
            });
        });
})();    
});

here is the shell script.

for i in *.json; do
    mongoimport -d db_name_here -c collection_name_here --type json --file "$i" --jsonArray
done
urfx
  • 165
  • 1
  • 10
  • getting late here so will sleep on it... this thread might help me ?? http://stackoverflow.com/questions/5827612/node-js-fs-readdir-recursive-directory-search – urfx Dec 05 '13 at 23:47
  • I mean perhaps I am doing a parallel loop here with the readdir not a serial one ? – urfx Dec 05 '13 at 23:50
  • Not clear what your question is / what the specific problem is. Is there an error? – WiredPrairie Dec 06 '13 at 00:46
  • @WiredPrairie it was one of those late at night things... in summary I think the problem turned out to be MongoDBs node.js drive don't like being on the end of a parallel loop... but don't quote me on it. I've shared the 'fix' in my original question. – urfx Dec 06 '13 at 14:58

1 Answers1

1

I'm surprised that works even once. fs.readfile requires an actual path. You should need to do something like:

fs.readFile('/Users/urfx/data/' + file, function(err, data) {
    // . . .
}

That may or may not be the answer to your problem, but it seemed, with the code example, better to put this in an answer than a comment.

EDIT: If you're really worried that mongo may be processing the first file when later files hit, you could try using async (try the "eachSeries" method instead of the Array forEach to ensure that they wait for previous ones to finish).

Also note that file+'_establishments.json' is going to end up looking like "somefile.xml_establishments.json" and with writeFile (like readFile) you need a path. Maybe:

'/Users/urfx/data/' + file.replace('.xml', '_establishments.json');
tandrewnichols
  • 3,456
  • 1
  • 28
  • 33
  • yes good point and helps me evade another bug further along. However (and as you said) that's not my problem. The reason it worked as is was because the script was being executed in the data folder. so thanks for spotting that error. much appreciated. the async module is a good call. I had tried that but it didn't work. But I didn't try eachSeries so let me try that and revert asap. Thanks again! – urfx Dec 06 '13 at 08:36