6

I have a dictionary with many entries and a huge vector as values. These vectors can be 60.000 dimensions large and I have about 60.000 entries in the dictionary. To save time, I want to store this after computation. However, using a pickle led to a huge file. I have tried storing to JSON, but the file remains extremely large (like 10.5 MB on a sample of 50 entries with less dimensions). I have also read about sparse matrices. As most entries will be 0, this is a possibility. Will this reduce the filesize? Is there any other way to store this information? Or am I just unlucky?

Update:

Thank you all for the replies. I want to store this data as these are word counts. For example, when given sentences, I store the amount of times word 0 (at location 0 in the array) appears in the sentence. There are obviously more words in all sentences than appear in one sentence, hence the many zeros. Then, I want to use this array tot train at least three, maybe six classifiers. It seemed easier to create the arrays with word counts and then run the classifiers over night to train and test. I use sklearn for this. This format was chosen to be consistent with other feature vector formats, which is why I am approaching the problem this way. If this is not the way to go, in this case, please let me know. I am very much aware that I have much to learn in coding efficiently!

I also started implementing sparse matrices. The file is even bigger now (testing with a sample set of 300 sentences).

Update 2: Thank you all for the tips. John Mee was right by not needing to store the data. Both he and Mike McKerns told me to use sparse matrices, which sped up calculation significantly! So thank you for your input. Now I have a new tool in my arsenal!

Cassie
  • 362
  • 4
  • 15
  • How many non-zero entries are there and how big is each non-zero entry? – Simeon Visser Jan 21 '15 at 00:03
  • 1
    A meg, a gig, size is relative... how does the problem manifest itself? I.e. what's the actual problem? Or are you just thinking ahead? – John Mee Jan 21 '15 at 00:18
  • The number of non zero entries varies from 1 to about 35. The entry values are integers. Most are 1-5. Some may be about 10. But that is an extreme case, if that ever appears. – Cassie Jan 21 '15 at 00:19
  • @john mee I tried this with a pickle. It ran on a server and has a size of 22 GB's. (Which I hoped was me misreading the filesize, but alas) – Cassie Jan 21 '15 at 00:22
  • Are you saying the data you need to store is at most 35 entries with each entry having at most 10 integers? In that case, yes, I'd discard all the zero entries and store only the non-zero data. – Simeon Visser Jan 21 '15 at 00:35
  • 1
    If all your data points are positive integers less than 10, you could theoretically store each in 4 bits. It's just a matter of how much time you want to spend processing the data to shrink it down. Formats like pickle and JSON are for convenience and portability, not for minimal file size. Maybe if you gave us more details about what you're doing it would help us figure out the best solution. Why are you storing this data? How often will you access it or modify it? – Fred S Jan 21 '15 at 00:39
  • Well if there is much redundant data, gzipping the file before or after writing it can save a lot of space. its not the best compression, but it is blazing fast. – ThorSummoner Jan 21 '15 at 00:44
  • Try to store them on HDF5 file http://deepdish.io/2014/11/11/python-dictionary-to-hdf5/ – user3378649 Jan 21 '15 at 00:53
  • `to save time I want to...` are you sure that stashing all this data is really going to save time? I'm thinking that this dataset is not your end result but is a mere stepping stone to a greater goal—it is a snapshot. If so, then ask yourself, if this is the best point in your pipeline to create a snapshot? Is there a less burdensome point just a little further up/down the pipe? – John Mee Jan 21 '15 at 06:18

2 Answers2

3

See my answer to a very closely related question https://stackoverflow.com/a/25244747/2379433, if you are ok with pickling to several files instead of a single file.

Also see: https://stackoverflow.com/a/21948720/2379433 for other potential improvements, and here too: https://stackoverflow.com/a/24471659/2379433.

If you are using numpy arrays, it can be very efficient, as both klepto and joblib understand how to use minimal state representation for an array. If you indeed have most elements of the arrays as zeros, then by all means, convert to sparse matrices... and you will find huge savings in storage size of the array.

As the links above discuss, you could use klepto -- which provides you with the ability to easily store dictionaries to disk or database, using a common API. klepto also enables you to pick a storage format (pickle, json, etc.) -- where HDF5 is coming soon. It can utilize both specialized pickle formats (like numpy's) and compression (if you care about size and not speed).

klepto gives you the option to store the dictionary with "all-in-one" file or "one-entry-per" file, and also can leverage multiprocessing or multithreading -- meaning that you can save and load dictionary items to/from the backend in parallel.

Community
  • 1
  • 1
Mike McKerns
  • 33,715
  • 8
  • 119
  • 139
0

With 60,000 dimensions do you mean 60,000 elements? if this is the case and the numbers are 1..10 then a reasonably compact but still efficient approach is to use a dictionary of Python array.array objects with 1 byte per element (type 'B').

The size in memory should be about 60,000 entries x 60,000 bytes, totaling 3.35Gb of data.

That data structure is pickled to about the same size to disk too.

6502
  • 112,025
  • 15
  • 165
  • 265
  • The OP says 22GB with pickle. – Mike McKerns Jan 21 '15 at 02:48
  • @MikeMcKerns: Then s/he's using a different data structure. A dict with 60k elements, integer keys and a 60k-elements unsigned byte arrays as values takes exactly 3,602,039,163 bytes (3.4Gb) and loads into memory in a few seconds on a reasonably modern PC. – 6502 Jan 21 '15 at 07:39
  • from the OP's update it sounds like that may indeed be the case. I also see the OP calls the structures "vectors" not "arrays", so they may be support `vector` objects from `scikit.learn`… – Mike McKerns Jan 21 '15 at 12:56