-2

I am trying to figure out what the best data structure to use in my code, I have considered dictionaries, list of dictionaries, classes etc but unsure what would be most efficient and fastest to use.

The program I wrote opens multiple text files and selects words based on a certain criteria, I then need to keep a track of unique words selected, the sentences they appear in, the files they appear in and a count of how many times they appear in total throughout the process.

I need to check if each selected word has already been added to the data structure as I iterate through the selected words (it will contain thousands of words).

If it has already been added then add the file it came from to a list as well as the sentence the word sits in and increment the count.

If not already there then add the word to the data structure, file and sentence and initialize count to 1.

I am not really constrained by memory but speed is an important factor so I am thinking that something like a C style trie could work, but not sure what would be the best way to implement that in python.

How would you do it?

Egill
  • 7
  • If speed is important then you should try benchmarking an implementation with each data structure – KyleL Mar 14 '20 at 14:40
  • 1
    I agree with @KyleL, we can make an educated guess, but there’s no substitute for benchmarking. – AMC Mar 14 '20 at 14:57

2 Answers2

1

I think it is better to use lists so that you can add the number of words you have according to your criteria, You can assign each of the word selected to each of the element of the array.Hope this is helpful to you :)

HoldOffHunger
  • 18,769
  • 10
  • 104
  • 133
  • We should avoid opinion based questions [1] and also answers. [1] https://stackoverflow.com/help/closed-questions – Mar Cial R Mar 24 '20 at 07:56
0

A dictionary would be a suitable data structure, using the words as the keys and the word count as the values. Under the hood this is a hash map and gives fast lookup time by taking a hash of the key. To increment the value of an existing word or add a new word you can use the get method with a default value of 0 if the word has not yet appeared.

word_count = {}
for w in ["red", "blue", "red", "green"]:
    word_count[w] = word_count.get(w, 0) + 1
print(word_count)

>>> {"red": 2, "blue": 1, "green": 1}

I think this is the simplest approach and should be fast enough, but if you are interested in other methods this question goes over the basics of a trie implementation.

As mentioned in the comments, it is not clear which method will actually be quicker, the only way to find out is to benchmark. Try running both implementations many times for lots of data values and see which one runs quicker!

KyleL
  • 855
  • 6
  • 24