Edit: sorry I will update with the latest code later assume tmpRaw is generated. But anyway it get out of memory similarly. I guess this is unfeasible there is nothing to do with it. As any compression must be dealt with inside the tf-idf algorithm as anyway the whole dataset is referenced there.
By saying a lot of small arrays, I mean not storing an array of arrays but really independent references because I'm feeding a ML algorithm (tf-idf).
My data is of this sort:
[[1115, 585], [69], [5584, 59, 99], ...]
Practically I don't have the full data, but only a generator so each sub-array at a time.
I didn't do the complexity and memory calculations, but I'm not far from achieving the goal on a local machine (I'm processing 1/3 of data then running out of memory).
I tried the following:
[Int32Array.from([1115, 585]), ...]
but it didn't help. Maybe typed arrays are not the best when you have small arrays.
Processing time doesn't matter a lot, as the algorithm runs pretty fast for now.
There might be a lot of repetitions of values, So I can build a dictionary but unfortunately there would be collisions with real values say for example:
1115 -> 1, 585 -> 2
But 1 and 2 might be allocated already in the real values fed to the tf-idf algorithm.
Finally I'm running on Node.
Edit
for (let index = 0; index < tmpRaw.length; index++) {
const nex = tmpRaw[index];
// nex is like "hello world"
try {
// indexStrGetter simply turns "hello world" to [848,8993] using a prior dictionary
doc = { id: co++, body: nex.map(indexStrGetter).filter(Boolean)}
// or the following is nearly the same, (when the memory get exhausted)
doc = { id: co++, body: Int32Array.from(nex.map(indexStrGetter)).filter(num => num > 0) }
if (!doc.body.length) {
co--
continue
}
// BM TF-IDF is this implementation (modified a little: https://gist.github.com/alixaxel/4858240e0ca20802af43)
bm.addDocument(doc)
} catch (error) {
console.log(error)
console.log(nex)
}
}