1

I have to open a very large file ~15GB and trying to read the whole file using fs.readFileSync and then put the whole file into hashmap based on a key to dedup the file. But then soon I hit the issue that I cant read the whole file into memory because of v8 limit!

I tried to pass the larger memory size using -max-old-space-size still its not working.

Why is that?

Is this a limitation in nodejs or I am missing something?

I have 64GB RAM in my machine.

For example, there is a large file data.txt with the following format and I have to dedup based on uuid:

new record
field_separator
1fd265da-e5a6-11ea-adc1-0242ac120002 <----uuid
field_separator
Bob
field_separator
32
field_separator
Software Engineer
field_separator
Workday
point_separator
new record
field_separator
5396553e-e5a6-11ea-adc1-0242ac120002
field_separator
Tom
field_separator
27
this is a field3
QA Engineer
field_separator
Synopsis
point_separator
........

There is another small file (200 mega) which contains UUID with different values. I have to lookup with the UUID from the above-mentioned file.

The script is just a one-time processing.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Exploring
  • 2,493
  • 11
  • 56
  • 97
  • How much memory do you have in your system? What exact output are you trying to achieve? We can only really help you with alternative methods if we can see the actual data and the actual operation you're trying to achieve. It is unlikely that the best way to achieve this is reading the entire file into memory at once. – jfriend00 Aug 24 '20 at 00:52
  • @jfriend00 I have 64GB RAM in my system. So if there is a way to put the complete file into memory with node s memory should not be the issue. – Exploring Aug 24 '20 at 00:58
  • What type of file is this? If you can't stream it, can you use a memory-mapped file? – Brad Aug 24 '20 at 01:04
  • its a text file – Exploring Aug 24 '20 at 01:06
  • What kind of text file? – Brad Aug 24 '20 at 01:06
  • Well, there should be no need to read the entire file into memory at once. As I said above, if you show us the actual data and what you're trying to accomplish, then we can help you with better strategies for processing the large file. I've written algorithms to dedup 100,000,000,000 items before and I had to use a partitioning and disk-based structure to do it efficiently. We can't know what to recommend for you without seeing your actual problem. – jfriend00 Aug 24 '20 at 01:09
  • I would say that no garbage collected system is very optimized for dealing with gazillions of objects efficiently because peak memory usage can be a lot higher in a garbage collected system, that a manually managed system (such as C++). – jfriend00 Aug 24 '20 at 01:10
  • @jfriend00 added an example and explained it further. – Exploring Aug 24 '20 at 01:50
  • What output are you trying to achieve? – jfriend00 Aug 24 '20 at 02:09
  • after deduping the first file need to merge result with the second small file. – Exploring Aug 24 '20 at 02:18
  • So, are you just trying to append records to the smaller file whose UUID is unique (not already present in the smaller file)? – jfriend00 Aug 24 '20 at 02:20
  • @jfriend00 yes - but please note in the big file some points are repeated and I need to merge the records based on UUID from the big file. – Exploring Aug 24 '20 at 02:26
  • Merge what? Merge records from the big file with the same UUID into one record which is then added to the small file? Or merge data from the big file with a common UUID with data that is already in the small file? What are the merge rules? Does big file data overwrite fields that already existed? What about when records in the large file have the same UUID and common fields? Which fields win? – jfriend00 Aug 24 '20 at 02:34

2 Answers2

1

Node documentation states the maximum buffer size is ~1GB on 32 bit systems and ~2GB on 64 bit systems.

You can also search Stack Overflow for questions about the maximum size of objects or heap memory used by V8, the JavaScript engine used in Node.js.

I uspect the chance of reading a 15GB file into memory and creating objects based on its entire content is about zero, and that you will need to look at alternatives to fs.readFileSync (such as reading a stream, using data base or using a differnt server).

It may be worth verifying that "avaialable" memory values in heap statistics reflect the size set using CLI option --max-old-space-size. Heap statistics can be generated by running

const v8 = require("v8");
console.log( v8.getHeapSpaceStatistics());
console.log( v8.getHeapStatistics());

in Node.

A question answered in 2017 asked about increasing fixed limit on string size. It may have been increased since then, but Comment 9 in (closed) issue 6148 said it was unlikely to ever increase over the limit of 32bit addressing (4GB).

Without changes to buffer and string size limits, fs.readFileSync cannot read and return the contents of a 16GB file as a string or buffer.

traktor
  • 17,588
  • 4
  • 32
  • 53
  • so what the `--max-old-space-size` flag does. I thought I can pass a large memory with this flag. – Exploring Aug 24 '20 at 01:05
  • 1
    @Exploring an issue of interest - pls see updated reply. – traktor Aug 24 '20 at 03:16
  • @Exploring As they say, trust but verify. Writing test code to find the maximum buffer/typed array size that can be used is good - documentation on the web may not always be applicable or the latest. Even if making file processing asynchronous may be a better option, you should be able to _synchronously_ read the file a buffer at a time by opening it and positioning the read with [fs.readvSync(fd, buffers[, position](https://nodejs.org/api/fs.html#fs_fs_readvsync_fd_buffers_position) . In theory at least - it's a long time since I've used file descriptors.. – traktor Aug 24 '20 at 03:46
0

If what you're trying to do is this:

Append records to the smaller file whose UUID is unique (not already present in the smaller file)

Then, I would suggest the following process.

  1. Design a scheme for reading the next record from a file and parsing the data into a Javascript object.
  2. Use that scheme to read through all the records in the smaller file (one record at a time), adding each UUID in that file to a Set object (for keeping track of uniqueness).
  3. After you're done with the small file, you now have a Set object containing all the already-known UUIDs.
  4. Now, use that same reading scheme to read each next record (one record at a time) from the larger file. If the record is not in the UUID Set, then add it to that Set and append that record to the smaller file. If the record is in the UUID Set, then skip it.
  5. Continue reading records from the large file until you've checked them all.
jfriend00
  • 683,504
  • 96
  • 985
  • 979