1

This must be a very standard problem that also must have a standard solution:

What is the correct way to incrementally save feature vectors extract from data, rather than accumulating all vectors form the entire dataset and then saving all of them at once?

In more detail:

I have written a script for extracting custum text features (e.g. next_token, prefix-3, is_number) form text documents. After extraction is done I end up with one big list of scipy sparse vectors. Finally I pickle that list to space efficiently store and time efficiently load it when I want to train a model. But the problem is, that I am limited by my ram here. I can make that list of vectors only so big before it or the pickling exceeds my ram.

Of course incrementally appending string representations of these vectors would be possible. One could accumulate k vectors, append them to a text file and clear the list again for the next k vectors. But storing vectors and string would be space inefficient and require parsing the representations upon loading. That does not sound like a good solution. I could also pickle sets of k vectors and end up with a whole bunch of pickle-files of k vectors. But that sounds messy.

So this must be a standard problem with a more elegant solution. What is the right method to solve this? Is there maybe even some existing functionality in scikit-learn for this kind of thing already, that I overlooked?

I found this: How to load one line at a time from a pickle file?

But it does not work with Python3.

lo tolmencre
  • 3,804
  • 3
  • 30
  • 60
  • If your vectors do indeed contain text, writing the string representation is indeed efficient. That way you aren't writing any unnecessary meta data and you have the added bonus that your dump is human readable – Mad Physicist Jul 12 '18 at 13:49
  • No the vectors are converted to float vectors – lo tolmencre Jul 12 '18 at 13:58
  • Even better. If you have a consistent known type you can omit type meta data. – Mad Physicist Jul 12 '18 at 15:00
  • I don't know too much about it, but I believe you can use hdf5 for this. See [this post](https://stackoverflow.com/questions/25655588/incremental-writes-to-hdf5-with-h5py) for a starting point – piman314 Jul 12 '18 at 15:17

0 Answers0