0

This question is language independent and is more about understanding how to implement the trie or whether tries would be appropriate for what my program is suppose to do. Say I have a string of text like this.

string= "a tale about an ant and an android";

The corresponding trie for "a" looks like this

      a(7)      
     /    \     
    b(1)  n(4)
    /     /   \
  o(1)  t(1)  d(2)
  /              \
 u(1)            r(1)
 /                 \
t(1)               o(1)
                     \
                     i(1)
                       \
                        d(1)

and I want to find the number of occurrences for each word. Although "a" appears 6 times in the text there is only one instance where it is used as a word. The same rule applies for "an" & "and".

I want my final frequency counter to look like this:

a: occurs 1 time not 7 an: 2 and: 1 and so on..

How is it possible for me to record the counts of full words?

I'm working in php trying to process a load of text and have visited this question and it is not what I am looking for. Performance is important but memory efficient is more preferable since I am parsing say a trillion words. Thanks and I appreciate your input.

Community
  • 1
  • 1
Joe Fokker
  • 43
  • 5

2 Answers2

0

I would recommend a ternary trie and then in the third edge you store the word. Then you can implement a word counter in it.

Micromega
  • 12,486
  • 7
  • 35
  • 72
0

You can do it in two ways:

  1. Instead of incrementing a node every time a word passes through, increment only when it ends there

  2. Have a pseudo letter at the end of the word (say blank), which will be incremented only when a word ends there.

ElKamina
  • 7,747
  • 28
  • 43