0

I'm trying to download a report from an API, but the way the data is being sent back is with a zipped folder with another folder inside of it and with a dozen or so zipped JSON files within that folder. For clarity, it looks like this:

report.zip/
├── reportID/ <- this is a regular folder
│   ├── reportID_123.json.gz
│   ├── reportID_456.json.gz
│   ├── reportID_789.json.gz
│   └── reportID_159.json.gz

I'm trying to unzip the first folder, then unzip each individual file, and finally loop through and read the contents of each JSON file and add them to a single object. But I'm having two issues.

The first is that while the first part of this code works when unzipping the first folder and extracting the name of each JSON file, and the second part works when unzipping them, each unzipped file is exactly the same as the one before (which isn't the case when doing the whole thing manually).

var zip = new AdmZip(tempFilePath);
var zipEntries = zip.getEntries(); // an array of ZipEntry records
const allZips = [];
const tempOutputPath = path.join(os.tmpdir(), 'output'); 

zip.extractAllTo(/*target path*/tempOutputPath, /*overwrite*/true);

zipEntries.forEach(function(zipEntry) {
    allZips.push(zipEntry.entryName);
});

console.log (allZips);
const allData = [];

for (var i = 0; i <= allZips.length; i++) {
    const zippedFileName = path.join(tempOutputPath, allZips[i]); 
    const finalOutputName = path.join(tempOutputPath, allZips[i].replace('.gz', ''));
    console.log(zippedFileName);
    const inp = fs.createReadStream(zippedFileName);
    const out = fs.createWriteStream(finalOutputName);
    inp.pipe(unzip).pipe(out);
    console.log('File piped successfully');
    console.log(finalOutputName);
    let rawData = fs.readFileSync(finalOutputName);
    let data = rawData.toString();
    console.log(data);
    allData.push(data);
}

The second problem is that even while looping through, it only manages to actually extract the data from some of the files, seemingly at random considering the code and the files are all identical apart from name. It might have something to do with the fact that as soon as the loop finishes, I get the following errors, despite the end of the loop also being the end of my code:

(node:15772) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 end listeners added. Use emitter.setMaxListeners() to increase limit
(node:15772) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 unpipe listeners added. Use emitter.setMaxListeners() to increase limit
(node:15772) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 drain listeners added. Use emitter.setMaxListeners() to increase limit
(node:15772) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 error listeners added. Use emitter.setMaxListeners() to increase limit
(node:15772) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 close listeners added. Use emitter.setMaxListeners() to increase limit
(node:15772) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 finish listeners added. Use emitter.setMaxListeners() to increase limit
(node:15772) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 data listeners added. Use emitter.setMaxListeners() to increase limit
(node:15772) UnhandledPromiseRejectionWarning: TypeError [ERR_INVALID_ARG_TYPE]: The "path" argument must be of type string. Received type undefined

One last thing in case it's relevant, this all seems to work better (although still not perfectly) when hardcoding the file names instead of using temporary file paths. Unfortunately I have to use temp file paths as this is for a Cloud Function that doesn't allow regular paths.

Jul
  • 375
  • 5
  • 18
  • 1
    `inp.pipe(unzip).pipe(out)`; whats the unzip? – Dan Starns Dec 13 '19 at 15:44
  • What is the point of everything inside the `for` loop. It looks like all you're doing is copying the file to a new name without the `.gz` extension? Is that all you're trying to do? If so, why not just rename the file? And, have you verified that `finalOutputName` is indeed what you want it to be and doesn't conflict with anything? – jfriend00 Dec 13 '19 at 16:59
  • Do you realize that the first thing this code does when you do this `var zip = new AdmZip(tempFilePath);` is read the entire file into memory with [`inBuffer = fs.readFileSync(filename);`](https://github.com/cthackers/adm-zip/blob/cb300c97f4b3b717bc7ba978e72d07b20968722e/zipFile.js#L17). First, that's synchronous I/O which is really bad for a server environment. Second, it's reading the entire file into memory when all you want to do is extract it to disk - unless you can guarantee that these are always small files, that's another bad thing to do in a server environment. – jfriend00 Dec 13 '19 at 17:18
  • @jfriend00 how else would I do this? I've googled for ages and can't seem to find any way of extracting the contents of a zip file without first reading the file into memory – Jul Dec 17 '19 at 09:22
  • There are a lot of zip packages, but both of these use streams and can extract a zip file without reading it into memory. It seems that all your code should be doing is extracting the zip file directly to your final location, not extracting it to a temporary location and then copying it somewhere. Here are two packages that seem to fit the bill: https://www.npmjs.com/package/extract-zip and https://www.npmjs.com/package/unzipper. – jfriend00 Dec 17 '19 at 15:51

1 Answers1

0

.pipe() is asynchronous. So, this line of code:

inp.pipe(unzip).pipe(out);

finishes at some unknown time in the future. So, you're trying to read the output file with this:

fs.readFileSync(finalOutputName);

Before you know that the output has finished. If you're going to use streams for this, then you need to watch for the close event on the writestream so you can THEN know that the .pipe() is completely done. You also should be watching for the error event on your streams to implement proper error handling.

And after implementing a listener for the close event to read the output, with the way your code is written, you will have to wait for all the streams to get their close event before you can use allData because only then will it have all the data in it.

In trying to understand the flow of your code to perhaps suggest an alternative, I see this line of code:

inp.pipe(unzip).pipe(out);

But, there is no definition of the variable unzip.

In addition, besides creating all the temporary files, it would be helpful to know what the end goal of this code is so we could perhaps suggest a better approach.


As a minor simplification, you can replace this:

const allZips = [];
zipEntries.forEach(function(zipEntry) {
    allZips.push(zipEntry.entryName);
});

with this:

const allZips = zipEntries.map(zipeEntry => zipEntry.entryName);

As I try to understand this code better, it appears that all your're trying to do with your two streams and your .pipe() is to copy the extracted files to a new location. You could do that a lot simpler with fs.copyFile() or fs.copyFileSync().

jfriend00
  • 683,504
  • 96
  • 985
  • 979