1

I created a unigram language model for a sentence completion implementation. I have all the words with their occurences number.

I'm confused on how to compare them from here. I would think that I have to calculate the probability of each case and take the biggest one.

So if I have 3 words that can be used, I compare the number of occurences of each word and take the highest ? Is this the proper implementation ?

Or I divide the number of occurences of each word with the number of all (distinct?) words of the training set ?

Thank you.

user3450862
  • 379
  • 1
  • 6
  • 22

1 Answers1

0

If you don't want to use any smoothing (Turing, Kneser-Ney, etc.), take the raw counts of each word (form) and divide them by the total word count of your corpus (text). This will give you the probability of each word. Now you don't always pick the one with the highest probability because your generated text would look like:

'the the the the the the the ...'

Instead, you have to pick words according to their probability (look here for explanation).

Btw, you gotta post code if you want suggestions to improve it.

Community
  • 1
  • 1
user2390182
  • 72,016
  • 6
  • 67
  • 89
  • Thanks. Here's what I did to calculate the probability of each word. float(nbocurrences_mot) / float(word_count)) After that, I choose the word with the biggest probability to use in my sentence. Since it's a unigram model, I don't care about other words in the phrase, I just have to choose the right one between the words I have. – user3450862 May 01 '16 at 14:21
  • I have a question about the bigram (and 3-gram) model, do I calculate the same float(nbocurrences_bigram) / float(nbtotal_bigrams) ? Or I read somewhere that it's P(a b) = P(a)*P(b) = nb(a)/nb(word_count) * nb(b) / nb(word_count ? Or also P(wi | w(i-1)) = c(wi-1,wi) /c(wi-1), so in my case P(a|b) = nbocurrences_bigram_ba / nbocurrences(b) – user3450862 May 01 '16 at 14:21
  • 1
    Well, you can just guess the 'right' (badly, I may add, with a unigram model as you do not use any context information). But again, you will always choose the same word ('the' for a large enough english corpus) if you choose the one with the highest prob. – user2390182 May 01 '16 at 14:26
  • The second of your suggestions for bigrams is the correct one. The first one is part of a measure for collocation detection: the expected number of bigram occurrences based on the unigram counts. Don't use that for language models! – user2390182 May 01 '16 at 14:28
  • For bigrams, I don't see how P(a b) = P(a)*P(b) = nb(a)/nb(word_count) * nb(b) / nb(word_count) takes into account the fact that a is followed by b. I have 4 cases: x a, x b, x c and x d and I want to choose the most probable word between a, b, c or d that follows the word x. – user3450862 May 01 '16 at 14:34
  • Exactly! That's why you don't use it – user2390182 May 01 '16 at 14:36
  • So it can be done for bigram like this: P(b|a) = nbocurrences_bigram_ab / nbocurrences(a), and for trigram : P(c|b|a) = nbocurrences_trigram_abc / nbocurrences_bigram(ab) and for fourgram : P(d|c|b|a) = nbocurrences_trigram_abcd / nbocurrences_trigram(abc) ? Or maybe for trigram would be P(c|b|a) = P(b|a) * P(c|a, b) = P(b|a) * P(c|b) = ( nbocurrences_bigram_ab / nbocurrences(a)) * ( nbocurrences_bigram_bc / nbocurrences(b) ) and for fourgram P(d|c|b|a) = P(b|a) * P(c|a, b) * P(d|a, b, c) = P(b|a) * P(c|b) * P(d|c) – user3450862 May 01 '16 at 16:17