I have 40.000 documents, 93.08 words per doc. on avg., where every word is a number (which can index a dictionary) and every word has a count (frequency). Read more here.
I am between two data structures to store the data and was wondering which one I should choose, which one the Python people would choose!
Triple-list:
A list, where every node:
__ is a list, where every node:
__.... is a list of two values; word_id
and count
.
Double-dictionary:
A dictionary, with keys the doc_id
and values dictionaries.
That value dictionary would have a word_id
as a key and the count
as a value.
I feel that the first will require less space (since it doesn't store the doc_id
), while the second will be more easy to handle and access. I mean, accessing the i-element in the list is O(n), while it is constant in the dictionary, I think. Which one should I choose?