0

I want to count the number of distinct values, and my naive solution is keeping a set and updating it until I finish the iteration, then I get the len of this set as my answer.

However, this is not possible when the dataset is large. And I have to count not only one type of distinct values for each iteration, meaning that I have to keep more sets.

I am wondering is there a better way to do this? Maybe some other built-in data structures can help me? Thanks!

Tom Leung
  • 334
  • 5
  • 18
  • 1
    If you're OK with O(NlogN) time (i.e., sorting the list first), you can do this with constant memory. – cs95 Jun 10 '19 at 04:26
  • 1
    @cs95 timsort is O(n) space complexity though, so it does not quite solve the issue. We world need using Quicksort – Olivier Melançon Jun 10 '19 at 04:31
  • @OlivierMelançon aaaah... fair point, some quick research says the memory used by timsort could be as much as N // 2 here (which is still linear, and a challenge for OP). – cs95 Jun 10 '19 at 04:34
  • @cs95 Sorry, I don't quite understand your solution. My situation is: I have to iterate a document set to calculate some properties of the words in the current document. These properties include distinct value count (eg. the number of distinct document cite the current document), so I maintain a `set` for such properties and calculate the length of the set in the end. How can I apply your solution to this problem? Thanks! – Tom Leung Jun 10 '19 at 04:57
  • 1
    As long as your set fits into memory this will be probably your best solution. Most specialized solutions usually deal with this-dataset-is-larger-than-memory problem. – Radosław Cybulski Jun 10 '19 at 06:41
  • Where is the dataset exactly, do you load it in a list/an array? Is it in a database? – Olivier Melançon Jun 10 '19 at 11:43
  • Finally I decided to buy more memory for my server XD, and thank all of you for your precious suggestions!! – Tom Leung Jun 11 '19 at 07:04

1 Answers1

2

Use a Trie. There are several python libraries, such as Marisa-trie. Or see this stack overflow answer to create your own How to create a TRIE in Python. Increment a counter each time a new word is added to the Trie.

Here's a simple nested dictionary implementation. It keeps track of the total number of words and the number of each word.

END = 'end'

class Trie:

    def __init__(self, words_iterable):
        self.root = {}
        self.size = 0

        for word in iter(words_iterable):
            self.insert(word)

    def insert(self, word):
        current_dict = self.root
        for letter in word:
            current_dict = current_dict.setdefault(letter, {})

        if END not in current_dict:
            current_dict[END] = 0
            self.size += 1
        current_dict[END] += 1

    def count(self, word):
        current_dict = self.root
        for letter in word:
            current_dict = current_dict.setdefault(letter, {})
        return current_dict.get(END, 0)

    def __len__(self):
        return self.size

    def __str__(self):
        return str(self.root)

Examples:

trie = Trie('one two one three four'.split())
trie.insert('four')

print(trie)

>>> {'o': {'n': {'e': {'end': 2}}}, 't': {'w': {'o': {'end': 1}}, 'h': {'r':
    {'e': {'e': {'end': 1}}}}}, 'f': {'o': {'u': {'r': {'end': 2}}}}}

len(trie)
>>> 4

trie.count('four')
>>> 2
RootTwo
  • 4,288
  • 1
  • 11
  • 15
  • This is most likely going to be a lot *more* expensive than the straightforward set solution, due to all the overhead of the extra dicts and bytecode interpretation. – user2357112 Jun 10 '19 at 06:36
  • How dictionary of dictionaries is better than single set? This seems to solve different problem. – Radosław Cybulski Jun 10 '19 at 06:40
  • This is definitely worth a try but would depend on OP dataset. There is a big overhead, but due to collisions between words its is possible that the size grows slower and slower as time goes. – Olivier Melançon Jun 10 '19 at 11:41
  • I agree a simple python nested dictionary implementation may not be ideal. If OP needs better space/time efficiency, the Marisa-trie link above is to a cython implementation and there is a C++ implementation that has Python bindings. At the Marisa-trie link it says it could save 50x-100x space over a regular dictionary (set). But it depends on the data set. – RootTwo Jun 10 '19 at 13:38
  • @RadosławCybulski just like a set, the Trie stores only 1 copy of each string. It may save space over a set, because each prefix of the strings is only stored once. In the example, the 't' in 'two' and 'three' is only stored once. – RootTwo Jun 10 '19 at 14:23
  • Ahh, ok. You're reducing memory footprint for storing keys. – Radosław Cybulski Jun 10 '19 at 15:20
  • @RootTwo Thank you! If I only want the length of the Trie and not interested in the occurrence count of specific elements, can I save more memory? – Tom Leung Jun 11 '19 at 07:07
  • @TomLeung that would save a little memory. Here is a paper that talks about using a Marisa-trie library to reduce storage from 82MB to less than 1MB for a corpus of news documents (https://blog.scrapinghub.com/2014/03/26/optimizing-memory-usage-of-scikit-learn-models-using-succinct-tries). – RootTwo Jun 12 '19 at 18:19