How to calculate bigram estimation without using nltk library?

Question

So, I am super new to python and I have this project of calculating bigrams without any use of python packages. I have to use python 2.7. This is what I have so far. It takes a file hello and then gives an output like {'Hello','How'} 5 . Now for the bigram estimation I have to divide 5 by the count of Hello (How many times 'Hello' appeared in the whole text file). I am stuck ANY HELP PLEASE !

f = open("hello.txt", 'r')
    dictionary={}
    for line in f:
        for word in line.split():
            items = line.split()
            bigrams = []
            for i in range(len(items) - 1):
                bigrams.append((items[i], items[i+1]))
                my_dict = {i:bigrams.count(i) for i in bigrams}
                # print(my_dict)
                with open('bigram.txt', 'wt') as out:
                    out.write(str(my_dict))
    f.close()

See https://stackoverflow.com/questions/7591258/fast-n-gram-calculation and https://stackoverflow.com/questions/21883108/fast-optimize-n-gram-implementations-in-python and https://stackoverflow.com/questions/40373414/counting-bigrams-real-fast-with-or-without-multiprocessing-python — alvas, Oct 09 '17 at 22:46
I need bigram estimation... all the other answers are just giving bigram. I need probability of it. EXAMPLE: Count(Hello How) / Count(Hello). Do yo know how to do it ? — Ash, Oct 10 '17 at 02:17
Ok ! and so that I can get some help I posted this question which you marked duplicate and the links that you provided do not help. N-gram is not helping me right now. — Ash, Oct 10 '17 at 04:34
I'll retract the close but it'll be marked as closed for asking for tutorial though =( — alvas, Oct 10 '17 at 08:41
Your problem can be solved using first order Markov model. Unfortunately, I cannot post the answer because it is marked as duplicate, while in fact it does not seem duplicate. — Mohammed, Oct 12 '17 at 22:10
@alvas the OP is trying to do the task without using any NLP packages. I hope you remove the lock. — Mohammed, Oct 12 '17 at 22:12
Thanks @Mohammed that is what I was trying to tell him. All the given solutions only calculate the number of bigrams appeared but dont do estimation. It is not like I didnt try. But I am new to python and was getting wrong answer. Some people just dont get it ! — Ash, Oct 19 '17 at 03:32
@alvas That is not a solution ! If someone provides merge sort code and asks how to further improve it. You dont say "you need a sorting model". You tell him how it can be solved, if not in code then pseudocode, if not pseudocode then plain english. But you explain you dont just take a name. Please dont make a question duplicate and then act like a dumb person after that. You have no right to do that. Now please stop and do not respond to this answer anymore. I am sick you giving me model names and not explaining anything. — Ash, Oct 20 '17 at 05:44
Yes, it's the solution or at least close to what you're looking for... You need a language model. That's what you're referring to as "ngram estimation". Please read up before posting on StackOverflow. https://stackoverflow.com/help/how-to-ask. Most answerer would require you to put some effort into research for an possible answer first, e.g. https://en.wikipedia.org/wiki/Language_model and http://kheafield.com/code/kenlm/ — alvas, Oct 20 '17 at 05:52
(1) Explain why you are stuck and people can better help you. (2) be less aggressive, it's an open platform, ask nicely and clearly; most probably you'll get someone nice enough to help you. (3) read good code, e.g. https://github.com/BigFav/n-grams/blob/master/ngrams.py or `kenlm` and it'll help a lot. Most of us started out from learning from example code ;P — alvas, Oct 20 '17 at 05:57
I have answered your question @Ash. If it is what you require, please accept the answer by hitting the ✔ check mark. If you need further help, please report here. — Mohammed, Oct 21 '17 at 08:17

score 1 · Answer 1 · answered Oct 21 '17 at 00:34

I am answering your question with a very simple code, just for the sake of illustration. Please note that bigram estimation is a bit more complicated than what you might though of. It needs to be done in divide and conquer approach. It can be estimated using different models, the most common of which are Hidden Markov Models, which I explain in the code below. Please note that the bigger the size of data, the better the estimation. I tested the following code on Brown Corpus.

def bigramEstimation(file):
    '''A very basic solution for the sake of illustration.
       It can be calculated in a more sophesticated way.
       '''

    lst = [] # This will contain the tokens
    unigrams = {} # for unigrams and their counts
    bigrams = {} # for bigrams and their counts

    # 1. Read the textfile, split it into a list
    text = open(file, 'r').read()
    lst = text.strip().split()
    print 'Read ', len(lst), ' tokens...'

    del text # No further need for text var



    # 2. Generate unigrams frequencies
    for l in lst:
        if not l in unigrams:
            unigrams[l] = 1
        else:
            unigrams[l] += 1

    print 'Generated ', len(unigrams), ' unigrams...'  

    # 3. Generate bigrams with frequencies
    for i in range(len(lst) - 1):
        temp = (lst[i], lst[i+1]) # Tuples are easier to reuse than nested lists
        if not temp in bigrams:
            bigrams[temp] = 1
        else:
            bigrams[temp] += 1

    print 'Generated ', len(bigrams), ' bigrams...'

    # Now Hidden Markov Model
    # bigramProb = (Count(bigram) / Count(first_word)) + (Count(first_word)/ total_words_in_corpus)
    # A few things we need to keep in mind
    total_corpus = sum(unigrams.values())
    # You can add smoothed estimation if you want


    print 'Calculating bigram probabilities and saving to file...'

    # Comment the following 4 lines if you do not want the header in the file. 
    with open("bigrams.txt", 'a') as out:
        out.write('Bigram' + '\t' + 'Bigram Count' + '\t' + 'Uni Count' + '\t' + 'Bigram Prob')
        out.write('\n')
        out.close()


    for k,v in bigrams.iteritems():
        # first_word = helle in ('hello', 'world')
        first_word = k[0]
        first_word_count = unigrams[first_word]
        bi_prob = bigrams[k] / unigrams[first_word]
        uni_prob = unigrams[first_word] / total_corpus

        final_prob = bi_prob + uni_prob
        with open("bigrams.txt", 'a') as out:
            out.write(k[0] + ' ' + k[1] + '\t' + str(v) + '\t' + str(first_word_count) + '\t' + str(final_prob)) # Delete whatever you don't want to print into a file
            out.write('\n')
            out.close()




# Callings
bigramEstimation('hello.txt')

I hope this helps you!

See also http://cs.nyu.edu/courses/spring17/CSCI-UA.0480-009/lecture3-and-half-n-grams.pdf — alvas, Oct 21 '17 at 08:06
thanks for the response. But I think it is little off. So if I have the text. "Hello Hello How" for bigram P(How | Hello) it should do count of (Hello How) which is 1 divided by count of (Hello) which is 2. Probability 1/2. — Ash, Oct 22 '17 at 23:06
I am asking for the score you are getting not what you should get? — Mohammed, Nov 02 '17 at 04:55

How to calculate bigram estimation without using nltk library?

1 Answers1