19

The goal: Upload large files to AWS Glacier without holding the whole file in memory.

I'm currently uploading to glacier now using fs.readFileSync() and things are working. But, I need to handle files larger than 4GB and I'd like to upload multiple chunks in parallel. This means moving to multipart uploads. I can choose the chunk size but then glacier needs every chunk to be the same size (except the last)

This thread suggests that I can set a chunk size on a read stream but that I'm not actually guaranteed to get it.

Any info on how I can get consistent parts without reading the whole file into memory and splitting it up manually?

Assuming I can get to that point I was just going to use cluster with a few processes pulling off the stream as fast as they can upload to AWS. If that seems like the wrong way to parallelize the work I'd love suggestions there.

kjs3
  • 5,758
  • 8
  • 34
  • 49

4 Answers4

29

If nothing else you can just use fs.open(), fs.read(), and fs.close() manually. Example:

var CHUNK_SIZE = 10 * 1024 * 1024, // 10MB
    buffer = Buffer.alloc(CHUNK_SIZE),
    filePath = '/tmp/foo';

fs.open(filePath, 'r', function(err, fd) {
  if (err) throw err;
  function readNextChunk() {
    fs.read(fd, buffer, 0, CHUNK_SIZE, null, function(err, nread) {
      if (err) throw err;

      if (nread === 0) {
        // done reading file, do any necessary finalization steps

        fs.close(fd, function(err) {
          if (err) throw err;
        });
        return;
      }

      var data;
      if (nread < CHUNK_SIZE)
        data = buffer.slice(0, nread);
      else
        data = buffer;

      // do something with `data`, then call `readNextChunk();`
    });
  }
  readNextChunk();
});
mscdex
  • 104,356
  • 15
  • 192
  • 153
  • 1
    While this answer is technically correct, using a regular file descriptor forgoes all of the stream event goodness which is really useful when you are trying to read/write a file in an async code base. – jlyonsmith Jul 13 '17 at 16:18
  • Running this code today causes the following warning: "(node:25440) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead." – theycallmemorty Sep 04 '19 at 19:12
  • 1
    @theycallmemorty This answer was from well before it was deprecated. I've updated it now for modern versions of node. – mscdex Sep 04 '19 at 19:48
  • You may prefer createReadStream solution which is more readable – Poyoman Sep 14 '20 at 13:36
  • 4
    @Poyoman This solution is the only way to _guarantee_ chunk sizes, which is what the OP asked for. If you don't need specific chunk sizes, then yes, streaming is much easier. – mscdex Sep 14 '20 at 18:05
  • I would suggest replacing the buffer. slice to buffer.subarray , https://nodejs.org/api/buffer.html#bufsubarraystart-end I had trouble reading some pdf files with the above approach. – vijay shiyani Nov 22 '22 at 06:00
12

You may consider using below snippet where we read file in chunk of 1024 bytes

var fs = require('fs');

var data = '';

var readStream = fs.createReadStream('/tmp/foo.txt',{ highWaterMark: 1 * 1024, encoding: 'utf8' });

readStream.on('data', function(chunk) {
    data += chunk;
    console.log('chunk Data : ')
    console.log(chunk);// your processing chunk logic will go here

}).on('end', function() {
    console.log('###################');
    console.log(data); 
// here you see all data processed at end of file
    });

Please Note : highWaterMark is the parameter used for chunk size Hope this Helps!

Web Reference: https://stackabuse.com/read-files-with-node-js/ Changing readstream chunksize

  • 4
    Finally a good example of Node streams that doesn't look like way too abstract nonsense and doesn't bore me explaining what binary numbers are. – Alexandre May 03 '20 at 10:52
1

Based on mscdex's answer here's a module using the sync alternative and with a StringDecoder to correctly parse UTF-8

The problem with readableStream is that in order to use it, you've to convert the entire project to use async emitters & callbacks. If you're coding something simple, like a small CLI in nodejs, it doesn't make sense.

//usage
let file = new UTF8FileReader()
file.open('./myfile.txt', 1024) 
while ( file.isOpen ) {
    let stringData=file.readChunk()
    console.log(stringData)
}


//--------------------
// UTF8FileReader.ts
//--------------------
import * as fs from 'fs';
import { StringDecoder, NodeStringDecoder } from "string_decoder";

export class UTF8FileReader {

    filename: string;
    isOpen: boolean = false;
    private chunkSize: number;
    private fd: number; //file handle from fs.OpenFileSync
    private readFilePos: number;
    private readBuffer: Buffer;

    private utf8decoder: NodeStringDecoder

    /**
     * open the file | throw
     * @param filename
     */
    open(filename, chunkSize: number = 16 * 1024) {

        this.chunkSize = chunkSize;

        try {
            this.fd = fs.openSync(filename, 'r');
        }
        catch (e) {
            throw new Error("opening " + filename + ", error:" + e.toString());
        }

        this.filename = filename;
        this.isOpen = true;

        this.readBuffer = Buffer.alloc(this.chunkSize);
        this.readFilePos = 0;

        //a StringDecoder is a buffered object that ensures complete UTF-8 multibyte decoding from a byte buffer
        this.utf8decoder = new StringDecoder('utf8')

    }

    /**
     * read another chunk from the file 
     * return the decoded UTF8 into a string
     * (or throw)
     * */
    readChunk(): string {

        let decodedString = '' //return '' by default

        if (!this.isOpen) {
            return decodedString;
        }

        let readByteCount: number;
        try {
            readByteCount = fs.readSync(this.fd, this.readBuffer, 0, this.chunkSize, this.readFilePos);
        }
        catch (e) {
            throw new Error("reading " + this.filename + ", error:" + e.toString());
        }

        if (readByteCount) {
            //some data read, advance readFilePos 
            this.readFilePos += readByteCount;
            //get only the read bytes (if we reached the end of the file)
            const onlyReadBytesBuf = this.readBuffer.slice(0, readByteCount);
            //correctly decode as utf8, and store in decodedString
            //yes, the api is called "write", but it decodes a string - it's a write-decode-and-return the string kind-of-thing :)
            decodedString = this.utf8decoder.write(onlyReadBytesBuf); 
        }
        else {
            //read returns 0 => all bytes read
            this.close();
        }
        return decodedString 
    }

    close() {
        if (!this.isOpen) {
            return;
        }
        fs.closeSync(this.fd);
        this.isOpen = false;
        this.utf8decoder.end();
    }

}

and here is the .js transpiled code if you don't have typescript yet:

// UTF8FileReader.js
"use strict";
Object.defineProperty(exports, "__esModule", { value: true });
exports.UTF8FileReader = void 0;
//--------------------
// UTF8FileReader
//--------------------
const fs = require("fs");
const string_decoder_1 = require("string_decoder");
class UTF8FileReader {
    constructor() {
        this.isOpen = false;
    }
    /**
     * open the file | throw
     * @param filename
     */
    open(filename, chunkSize = 16 * 1024) {
        this.chunkSize = chunkSize;
        try {
            this.fd = fs.openSync(filename, 'r');
        }
        catch (e) {
            throw new Error("opening " + filename + ", error:" + e.toString());
        }
        this.filename = filename;
        this.isOpen = true;
        this.readBuffer = Buffer.alloc(this.chunkSize);
        this.readFilePos = 0;
        //a StringDecoder is a buffered object that ensures complete UTF-8 multibyte decoding from a byte buffer
        this.utf8decoder = new string_decoder_1.StringDecoder('utf8');
    }
    /**
     * read another chunk from the file
     * return the decoded UTF8 into a string
     * (or throw)
     * */
    readChunk() {
        let decodedString = ''; //return '' by default
        if (!this.isOpen) {
            return decodedString;
        }
        let readByteCount;
        try {
            readByteCount = fs.readSync(this.fd, this.readBuffer, 0, this.chunkSize, this.readFilePos);
        }
        catch (e) {
            throw new Error("reading " + this.filename + ", error:" + e.toString());
        }
        if (readByteCount) {
            //some data read, advance readFilePos 
            this.readFilePos += readByteCount;
            //get only the read bytes (if we reached the end of the file)
            const onlyReadBytesBuf = this.readBuffer.slice(0, readByteCount);
            //correctly decode as utf8, and store in decodedString
            //yes, the api is called "write", but it decodes a string - it's a write-decode-and-return the string kind-of-thing :)
            decodedString = this.utf8decoder.write(onlyReadBytesBuf);
        }
        else {
            //read returns 0 => all bytes read
            this.close();
        }
        return decodedString;
    }
    close() {
        if (!this.isOpen) {
            return;
        }
        fs.closeSync(this.fd);
        this.isOpen = false;
        this.utf8decoder.end();
    }
}
exports.UTF8FileReader = UTF8FileReader;
Lucio M. Tato
  • 5,639
  • 2
  • 31
  • 30
0

I would suggest something like this, as buffer.slice is deprecated and for me, it had issues while reading big pdf files.

import {promisify} from 'node:util';
import fs from 'node:fs';
import {Buffer} from 'node:buffer';
import pify from 'pify';

const fsReadP = pify(fs.read, {multiArgs: true});
const fsOpenP = promisify(fs.open);
const fsCloseP = promisify(fs.close);

export async function readChunk(filePath, {length, startPosition}) {
    const fileDescriptor = await fsOpenP(filePath, 'r');

    try {
        let [bytesRead, buffer] = await fsReadP(fileDescriptor, {
            buffer: Buffer.alloc(length),
            length,
            position: startPosition,
        });

        if (bytesRead < length) {
            buffer = buffer.subarray(0, bytesRead);
        }

        return buffer;
    } finally {
        await fsCloseP(fileDescriptor);
    }
}

export function readChunkSync(filePath, {length, startPosition}) {
    let buffer = Buffer.alloc(length);
    const fileDescriptor = fs.openSync(filePath, 'r');

    try {
        const bytesRead = fs.readSync(fileDescriptor, buffer, {
            length,
            position: startPosition,
        });

        if (bytesRead < length) {
            buffer = buffer.subarray(0, bytesRead);
        }

        return buffer;
    } finally {
        fs.closeSync(fileDescriptor);
    }
}
vijay shiyani
  • 766
  • 7
  • 19