1

The dataset is like this:

39861    // number of documents
28102    // number of words of the vocabulary (another file)
3710420  // number of nonzero counts in the bag-of-words
1 118 1  // document_id index_in_vocabulary count
1 285 3
...
2 46 1
...
39861 27196 5

We are advised not to store that in matrix (of size 39861 x 39861 I guess), since it won't fit in memory* and from here I can assume that every integer will need 24 bytes to be stored, thus 27 Gb (=39861*28102*24 bytes) with a dense matrix. So, which data structure should I use to store the dataset?


An array of lists?

  • If so( every list will have nodes with two data-members, the index_in_vocubulary and the count), just post a positive answer. If I assume that every document has on average 200 words, then the space would be:

no_of_documents x words_per_doc * no_of_datamembers * 24 = 39861*200*2*24 = 0.4 Gb

  • If not, which one would you propose (which would require less space)?

After storing the dataset, we are required to find k-Nearest Neighbors (k similar documents), with brute force and LSH.


*I have 3.8 GiB in my personal laptop, but I have access to a desktop with ~8Gb RAM.

Community
  • 1
  • 1
gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • Who advised us not to store the data in a matrix? Maybe we should just ask them. – Robᵩ Apr 29 '16 at 17:02
  • The professor @Robᵩ, I updated. The reason is that it won't fit in memory. The course however doesn't have any TA (!?!?), so we are in our own, literally. – gsamaras Apr 29 '16 at 17:04
  • After storing the data, what next? Will you have to retrieve the data? Search in it? Perform graph operations? – Robᵩ Apr 29 '16 at 17:06
  • @Robᵩ, I updated my question. We are required to find k similar documents. What do you say about the array of lists I am proposing? – gsamaras Apr 30 '16 at 16:28
  • # Consider using HDF5 format. It shall significantly reduce the size of your file. See [my answer to similar question](https://stackoverflow.com/a/36899678/346478) – Jan Vlcinsky Apr 29 '16 at 17:13
  • Jan good idea bringing HDF5 format in play, +1. However, I recall using that format for storing into files only and when loading the dataset to memory, we would use our own data-structure, in C++ back then.. – gsamaras Apr 29 '16 at 17:16
  • I am not sure, I understand your comment. If you stor the data into HDF5 file, you may take advantage of the fact, the format is defined and there are APIs for various programming languages (Python, C++, Java, FORTRAN and many others). – Jan Vlcinsky Apr 29 '16 at 17:20
  • Hmm I see Jan. However, your answer doesn't say why I shouldn't use the Data Structure I propose. – gsamaras Apr 30 '16 at 16:29
  • @gsamaras Your question states your worry about total size of data being stored. If you manage the data in the structure you propose, go for it. I proposed HDF5 as I know, it is very space efficent and it even allows varous queries over the file. – Jan Vlcinsky Apr 30 '16 at 20:17
  • Just asked Jan, because I sensed that you saw a fault with my structure, that's why I asked, thank you. – gsamaras Apr 30 '16 at 22:44

0 Answers0