0

I have 40.000 documents, 93.08 words per doc. on avg., where every word is a number (which can index a dictionary) and every word has a count (frequency). Read more here.

I am between two data structures to store the data and was wondering which one I should choose, which one the Python people would choose!

Triple-list:

A list, where every node:

__ is a list, where every node:

__.... is a list of two values; word_id and count.

Double-dictionary:

A dictionary, with keys the doc_id and values dictionaries.

That value dictionary would have a word_id as a key and the count as a value.


I feel that the first will require less space (since it doesn't store the doc_id), while the second will be more easy to handle and access. I mean, accessing the i-element in the list is O(n), while it is constant in the dictionary, I think. Which one should I choose?

Community
  • 1
  • 1
gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • 2
    Accessing the ith element of a list is O(1). – Alex Hall May 01 '16 at 15:40
  • 1
    @AlexHall really? Can you point me to some "evidence" about it? In [tag:C] it's O(n). – gsamaras May 01 '16 at 15:42
  • @gsamaras Accessing the ith element in a list in nearly every programming language, C included, is O(1). Think about it this way: an array is stored as a contiguous block of memory. I'm not too sure how the internal Python implementation works (because Python lists are inherently generic), but the argument still holds - if you have an array of size N, then the i-th element is located at `(address for beginning of array in memory) + (i * size of element)`. – Rushy Panchal May 01 '16 at 15:43
  • 2
    A simple linked list is O(n) @RushyPanchal. – gsamaras May 01 '16 at 15:44
  • 2
    Accessing an index is indeed O(1) but FINDING an element in a list with an unknown index is O(n) – Mate Hegedus May 01 '16 at 15:44
  • 2
    Python lists are not linked lists, they are like Java ArrayLists or C++ vectors. They're dynamically growing arrays which you can randomly access by calculating an offset. But yes, in many languages the word list by default means linked list. – Alex Hall May 01 '16 at 15:45
  • 1
    Oh yeah, I was meaning what @MateHegedus said! Alex Hall, that clears things up! ;) – gsamaras May 01 '16 at 15:45
  • @gsamaras A list is vastly different from a linked list. Searching a list, of course, is naturally O(n) unless you know something about the contents and ordering. – Rushy Panchal May 01 '16 at 15:46
  • 2
    Here's a good reference for time complexity of Python lists: https://wiki.python.org/moin/TimeComplexity – ChrisGPT was on strike May 01 '16 at 15:46
  • @Chris that's what I was searching for, thank you! – gsamaras May 01 '16 at 15:47
  • Also, accessing the ith element of a C array is O(1) because `a[i]` is functionally equivalent to `*(a + i)`, where `a` is a pointer to the start of an array (which is how arrays are represented) and `a + i` is just pointer arithmetic. – Rushy Panchal May 01 '16 at 15:47

3 Answers3

3

You should use a dictionary. It will make handling your code easier to understand and to program and it will have a lower complexity as well.

The only reason you would use a list, is if you cared about the order of the documents.

Mate Hegedus
  • 2,887
  • 1
  • 19
  • 30
  • But it would require more space, right? I do not care about the order. ;) – gsamaras May 01 '16 at 15:43
  • @gsamaras As a factor of N, it would not. – Mate Hegedus May 01 '16 at 15:48
  • @gsamaras The only additional space is for the keys, but Python strings are only stored "once" in memory per string – that is, if you have the same set of keys for each element, then it's a constant additional space (which is negligible). – Rushy Panchal May 01 '16 at 15:49
  • @RushyPanchal which strings? `doc_id`, `word_id` and `count` are numbers, integers to be exact. Mate, in big o notation of course, but in practice? – gsamaras May 01 '16 at 15:50
  • @gsamaras the keys for a dictionary, unless you are storing them as something else. But if you're using a dictionary for clarity, then it seems intuitive to also use strings for keys. – Rushy Panchal May 01 '16 at 15:57
2

If you don't care about the order of the items you should definitely use a dictionary because dictionaries are used to group associated data while lists are generally used to group more generic items.

Moreover lookups in dictionaries are faster than that of a list.

Lookups in lists are O(n) while lookups in dictionaries are O(1). though lists are considerably larger in Memory than lists

danidee
  • 9,298
  • 2
  • 35
  • 55
1

Essentially you just want to store a large amount of numbers, for which the most space efficient choice is an array. These are one-dimensional so you could write a class which takes in three indices (the last being 0 for word_id and 1 for count) and does some basic addition and multiplication to find the correct 1D index.

Alex Hall
  • 34,833
  • 5
  • 57
  • 89
  • Why this got a downvote? I will upvote, since the idea doesn't seem bad and the downvoter didn't justify his/her action. – gsamaras May 01 '16 at 16:15