1

Edit: sorry I will update with the latest code later assume tmpRaw is generated. But anyway it get out of memory similarly. I guess this is unfeasible there is nothing to do with it. As any compression must be dealt with inside the tf-idf algorithm as anyway the whole dataset is referenced there.

By saying a lot of small arrays, I mean not storing an array of arrays but really independent references because I'm feeding a ML algorithm (tf-idf).

My data is of this sort:

[[1115, 585], [69], [5584, 59, 99], ...]

Practically I don't have the full data, but only a generator so each sub-array at a time.

I didn't do the complexity and memory calculations, but I'm not far from achieving the goal on a local machine (I'm processing 1/3 of data then running out of memory).

I tried the following:

[Int32Array.from([1115, 585]), ...] but it didn't help. Maybe typed arrays are not the best when you have small arrays.

Processing time doesn't matter a lot, as the algorithm runs pretty fast for now.

There might be a lot of repetitions of values, So I can build a dictionary but unfortunately there would be collisions with real values say for example:

1115 -> 1, 585 -> 2

But 1 and 2 might be allocated already in the real values fed to the tf-idf algorithm.

Finally I'm running on Node.

Edit

for (let index = 0; index < tmpRaw.length; index++) {
    const nex = tmpRaw[index];
    // nex is like "hello world"
    try {
        // indexStrGetter simply turns "hello world" to [848,8993] using a prior dictionary
        doc = { id: co++, body: nex.map(indexStrGetter).filter(Boolean)}
        // or the following is nearly the same, (when the memory get exhausted)
      doc = { id: co++, body: Int32Array.from(nex.map(indexStrGetter)).filter(num => num > 0) }
        if (!doc.body.length) {
            co--
            continue
        }
      // BM TF-IDF is this implementation (modified a little: https://gist.github.com/alixaxel/4858240e0ca20802af43)
      bm.addDocument(doc)
    } catch (error) {
        console.log(error)
        console.log(nex)
    }
}
Curcuma_
  • 851
  • 2
  • 12
  • 37
  • one thought is to build a dictionary using the biggest values of Int32Array like from the end. But I feel I'm going far for this – Curcuma_ Sep 02 '22 at 09:07
  • 1
    Sharing your current code can be helpful – Kemal Cengiz Sep 02 '22 at 09:25
  • I guess this is unfeasible there is nothing to do with it. As any compression must be dealt with inside the tf-idf algorithm – Curcuma_ Sep 02 '22 at 09:52
  • Do you really need to store the whole thing in memory simultaneously? – Ouroborus Sep 02 '22 at 09:56
  • Where does the data come from? *I didn't do the [...] memory calculations* - when you are running out of memory, it could make sense having an estimate... – tevemadar Sep 02 '22 at 09:58
  • By 1/3 I thought the order is fine to lookup for a solution. The data is Wikipedia titles and around 300mb before cleansing. The tf-idf model becomes a lot bigger, so why are arrays of numbers instead of strings, is because I built a dictionary already of unique words. But see my solution as I think nothing much to do about it – Curcuma_ Sep 02 '22 at 10:06
  • How did you arrive at the conclusion that the array is the memory culprit? See https://stackoverflow.com/questions/20018588/how-to-monitor-the-memory-usage-of-node-js for guidance on narrowing memory leaks / exhaustion... – Trentium Sep 02 '22 at 17:01
  • I came to conclusion that it is mere data allocation and there is no way to compress that. Admitting I use tf-idf for my aim. For the cause it is clear that it is the array being fed to tf-idf, not inside the loop but it stacks anyway in tf-idf – Curcuma_ Sep 02 '22 at 18:29

3 Answers3

1

If at any given moment you need to iterate over the outer array I think that the best way to get an efficiency improvement is to generate the array inside the loop, so when each iteration ends the variable is destroyed and a lot of memory will be saved.

  • The sub-arrays are yielded and consumed instantly, (in the top level code inside the loop), I think it does automatically un-reference generated values (subarrays) but if and only if they are not used, but actually I use them, I feed the tf-idf algorithm with sub-arrays – Curcuma_ Sep 02 '22 at 09:18
  • This is exactly the problem , generated values are not a problem because they are generated. But you need to feed them, and then the tf-idf algorithm will store them – Curcuma_ Sep 02 '22 at 09:20
1

You can use a 1-dimensional array and have a separator (for ex null) for each sub array. Then your array will become this:

[1115, 585, null, 69, null, 5584, 59, 99, null, ...]

This way you will get rid of a lot of sub-array pointers

Kemal Cengiz
  • 133
  • 1
  • 8
0

As I have thought of it, there nothing much to do to gain memory as data just needs that minimum.

All I can do is to build sub models (tf-idf) with chunks of data. With a bit less of learning accuracy.

if Node does efficiently store same strings in a single process like showed here: https://levelup.gitconnected.com/bytefish-vs-new-string-bytefish-what-is-the-difference-a795f6a7a08b

the last optimization would be to not convert words to numbers as ive been doing.

Curcuma_
  • 851
  • 2
  • 12
  • 37