1

Let i have the following class in python:

class Word:
def __init__(self, _lemma, _frequency):
    self.lemma = str(_lemma)
    self.frequency = int(_frequency) 

now i want to create a collection of class Word which hold following logic when an Word object word1 is being added to collection:

  • if the collection contains a Word object word where word.lemma = word1.lemma then word.frequency = word.frequency + word1.frequency
  • else add word1 to collection

How can i do it?


Previously i used a list to do so where i checked if the list contains a Word object which has same lemma as word1.lemma. But the approach has O(n^2) complexity to add n word in the collection.

from Word import Word

class Corpus:

    def __init__(self, _name, _total_count):
        self.name = str(_name)
        self.total_count = int(_total_count)
        self.words = []

    def add(self, _word):

        find_word = [index for index, word in enumerate(self.words) if word.lemma == _word.lemma]  # O(n)
        if len(find_word) == 0:
            self.words.append(Word(_word.lemma, _word.frequency))
        else:
            self.words[find_word[0]].frequency = self.words[find_word[0]].frequency + _word.frequency
Shamsul Arefin
  • 661
  • 7
  • 15

2 Answers2

3

You could do it easily by using a dictionary instead of a list, using the word.lemma as key:

def add(self, _word):
    if _word.lemma not in self.words:
        self.words[_word.lemma] = _word
    else:
        self.words[_word.lemma].frequency += _word.frequency

a inconvenient is that it duplicates the lemma information...


If using a Word class is not mandatory, your could use a defaultdict (with a 0 default value) that just associate frequency (value) to lemma (key):

class Corpus:
    def __init__(...):
        ...
        self.words = defaultdict(lambda: 0)

    def add(self, lemma, frequency):
        self.words[lemma] += frequency
Tryph
  • 5,946
  • 28
  • 49
2

Your wording may confuse community members, who're familiar with Python. I think you're using "dictionary" term as the part of your domain model and not as data structure in Python.

If you really need both Word and Corpus classes - you should go forward with code like this:

from collections import defaultdict


class Word:

    def __init__(self, lemma: str, frequency: int):
        self.lemma = lemma
        self.frequency = frequency

    def __eq__(self, other):
        return self.lemma == other.lemma

   def __hash__(self):
       return hash(self.lemma)


class Corpus:

    def __init__(self):
        self.words = defaultdict(0)

    def add(self, word: Word):
        self.words[word] += word.frequency

Key points are:

  1. Usage of type hints
  2. How dict lookup (e.g. 'b' in {'a': 23, 'b': 24}) is working - When does __eq__ gets called using hash()?
  3. defaultdict usage
  4. __eq__ and __hash__ usage

And I highly recommend to think if you really want to store Word instances in Corpus.

Dmitry Belaventsev
  • 6,347
  • 12
  • 52
  • 75
  • Yes. I can omit storing `Word` instances in `Corpus`. I can use the approach showed by Tryph something like using a `defaultdict` (with a 0 default value) that just associate frequency (value) to lemma (key). But can you please explain me why you are telling me to reconsider storing `Word` instances in `Corpus`? – Shamsul Arefin Mar 13 '19 at 19:07
  • Way of storing `Word` instances will let you encapsulate more complex logic behind frequency accumulation (e.g. assume both lowercase and uppercase words equal) and keep cohesion high. But if you're ok with just storing words summing up frequency of them (case sensitive) - why not to just store string representations of words. – Dmitry Belaventsev Mar 14 '19 at 05:21
  • tl;dr - store `Word` instances if you plan to make your algorithm more complex; store just string representation of lemmas if it 100% remain the same – Dmitry Belaventsev Mar 14 '19 at 05:22