0

I'm aware that I'm doing a computationally unfeasible thing, essentially doing Get all substrings of a string in JavaScript but with bytes of a 1MB exe instead of characters in a string.

But I wanted to see how many bytes all the segments would add up to, at least until my program crashed. Well it does crash, but I think my byte count is wrong.

const fs = require("fs");

const bytesPerKB = 1000;
const bytesPerMB = bytesPerKB * 1000;
const bytesPerGB = bytesPerMB * 1000;
function getAllSegments(buffer, skip = 1) {
  let i, j, result = [], bytes = 0;

  for (i = 0; i < buffer.length; i += skip) { 
      if (i % 1000 === 0) console.log('getting ranges for byte', i, 'with a total of', bytes / bytesPerGB, 'GB stored')
      for (j = i + 1; j < buffer.length + 1; j++) {  
          const entry = buffer.slice(i, j)
          bytes += entry.length
          result.push(entry);
      }
  }
  return result;
}

console.log('ready')

fs.promises.readFile('../data/scraped/test-1MB.exe').then(data => {
  console.log('read file', data)
  let segements = getAllSegments(data, 10000)
  console.log('segments', segements);
})

output:

enter image description here

I'm pretty sure I don't have 8 TBs of storage on my PC, much less 8TB of swap space allocated. What'd I do wrong with the byte counting math?

J.Todd
  • 707
  • 1
  • 12
  • 34
  • Surely there's a straightforward closed-form solution to the problem. Your code as it stands is horribly inefficient. How many "substrings" are there of length 1? Of length 2? Of length 3? If you think about the problem that way you don't have to write a program that crashes to get the answer. – Pointy Jul 01 '21 at 02:05
  • @Pointy that's normally true, but with SO we're encouraged to create a minimal reproducible example of our problem which sometimes means excluding the more complicated intended uses of the example. In this case: Actually iterating over the maximum feasible byte range and doing some analysis against each segment selected from that range. 1MB is a bit much, but this is a lot more feasible at 0.5MB. – J.Todd Jul 01 '21 at 14:09

1 Answers1

1

For every single start position in the buffer, you accumulate the length of all possible substrings after that start position to the end of the buffer. Then, you repeat that processing starting one byte further into the buffer. There are gazillions of duplicates and lots of overlap that you are counting so, of course, it all adds up to way more than the size of the file or the size of your memory.

As for memory usage, buffer.slice(), returns a new Buffer object that references the original memory so that's why memory usage doesn't blow up as each of your sub-buffers is not a separate copy of the data. It's just a new Buffer object that "points" into the existing buffer with an offset and a length.

From the doc for buffer.slice():

Returns a new Buffer that references the same memory as the original, but offset and cropped by the start and end indices.

jfriend00
  • 683,504
  • 96
  • 985
  • 979